1. About dataset
2. Preparing data for analysis - importing libraries, reading data...
3. Univariate analysis
4. Pre-processing data
5. Vectorizing all features - preparing data for classification and modelling
6. Vectorizing data using t-SNE
7. Classification & Modelling Using K-NN
7.1 Classification using k-NN(simple cross validation)
7.1.1 Classification using k-NN(simple cross validation) on imbalanced data
7.1.2 Classification using k-NN(simple cross validation) on balanced data
7.2 Classification using k-NN(k-fold cross validation)
7.2.1 Classification using k-NN(k-fold cross validation) on imbalanced data
7.2.2 Classification using k-NN(k-fold cross validation) on balanced data
7.3 Classification using k-NN(k-fold cross validation & feature selection)
7.3.1 Classification using k-NN(k-fold cross validation & feature_selection) on imbalanced data
7.3.2 Classification using k-NN(k-fold cross validation & feature_selection) on balanced data
7.4 Results of analysis using k-NN
7.5 Conclusions of analysis using k-NN
Founded in 2000 by a high school teacher in the Bronx, DonorsChoose.org empowers public school teachers from across the country to request much-needed materials and experiences for their students. At any given time, there are thousands of classroom requests that can be brought to life with a gift of any amount.
DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.
Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:
The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.
The train.csv data set provided by DonorsChoose contains the following features:
| Feature | Description |
|---|---|
project_id |
A unique identifier for the proposed project. Example: p036502 |
project_title |
Title of the project. Examples:
|
project_grade_category |
Grade level of students for which the project is targeted. One of the following enumerated values:
|
project_subject_categories |
One or more (comma-separated) subject categories for the project from the following enumerated list of values:
Examples:
|
school_state |
State where school is located (Two-letter U.S. postal code). Example: WY |
project_subject_subcategories |
One or more (comma-separated) subject subcategories for the project. Examples:
|
project_resource_summary |
An explanation of the resources needed for the project. Example:
|
project_essay_1 |
First application essay* |
project_essay_2 |
Second application essay* |
project_essay_3 |
Third application essay* |
project_essay_4 |
Fourth application essay* |
project_submitted_datetime |
Datetime when project application was submitted. Example: 2016-04-28 12:43:56.245 |
teacher_id |
A unique identifier for the teacher of the proposed project. Example: bdf8baa8fedef6bfeec7ae4ff1c15c56 |
teacher_prefix |
Teacher's title. One of the following enumerated values:
|
teacher_number_of_previously_posted_projects |
Number of project applications previously submitted by the same teacher. Example: 2 |
* See the section Notes on the Essay Data for more details about these features.
Additionally, the resources.csv data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:
| Feature | Description |
|---|---|
id |
A project_id value from the train.csv file. Example: p036502 |
description |
Desciption of the resource. Example: Tenor Saxophone Reeds, Box of 25 |
quantity |
Quantity of the resource required. Example: 3 |
price |
Price of the resource required. Example: 9.95 |
Note: Many projects require multiple resources. The id value corresponds to a project_id in train.csv, so you use it as a key to retrieve all resources needed for a project:
The data set contains the following label (the value you will attempt to predict):
| Label | Description |
|---|---|
project_is_approved |
A binary flag indicating whether DonorsChoose approved the project. A value of 0 indicates the project was not approved, and a value of 1 indicates the project was approved. |
# numpy for easy numerical computations
import numpy as np
# pandas for dataframes and filterings
import pandas as pd
# sqlite3 library for performing operations on sqlite file
import sqlite3
# matplotlib for plotting graphs
import matplotlib.pyplot as plt
# seaborn library for easy plotting
import seaborn as sbrn
# warnings library for specific settings
import warnings
# regularlanguage for regex operations
import re
# For loading precomputed models
import pickle
# tqdm for tracking progress of loops
from tqdm import tqdm_notebook as tqdm
# For creating dictionary of words
from collections import Counter
# For creating BagOfWords Model
from sklearn.feature_extraction.text import CountVectorizer
# For creating TfidfModel
from sklearn.feature_extraction.text import TfidfVectorizer
# For standardizing values
from sklearn.preprocessing import StandardScaler
# For merging sparse matrices along row direction
from scipy.sparse import hstack
# For merging sparse matrices along column direction
from scipy.sparse import vstack
# For calculating TSNE values
from sklearn.manifold import TSNE
# For calculating the accuracy score on cross validate data
from sklearn.metrics import accuracy_score
# For performing the k-fold cross validation
from sklearn.model_selection import cross_val_score
# For splitting the data set into test and train data
from sklearn import cross_validation
# KNeighbors classifier for classification
from sklearn.neighbors import KNeighborsClassifier
# For creating samples for making dataset balanced
from sklearn.utils import resample
# For shuffling the dataframes
from sklearn.utils import shuffle
# For calculating roc_curve parameters
from sklearn.metrics import roc_curve
# For calculating auc value
from sklearn.metrics import auc
# For displaying results in table format
from prettytable import PrettyTable
# For generating confusion matrix
from sklearn.metrics import confusion_matrix
# For selecting most useful features
from sklearn.feature_selection import SelectKBest, f_classif
warnings.filterwarnings('ignore')
projectsData = pd.read_csv('train_data.csv');
resourcesData = pd.read_csv('resources.csv');
projectsData.head(3)
projectsData.tail(3)
resourcesData.head(3)
resourcesData.tail(3)
def equalsBorder(numberOfEqualSigns):
"""
This function prints passed number of equal signs
"""
print("="* numberOfEqualSigns);
# Citation link: https://stackoverflow.com/questions/8924173/how-do-i-print-bold-text-in-python
class color:
PURPLE = '\033[95m'
CYAN = '\033[96m'
DARKCYAN = '\033[36m'
BLUE = '\033[94m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
END = '\033[0m'
def printStyle(text, style):
"This function prints text with the style passed to it"
print(style + text + color.END);
printStyle("Number of data points in projects data: {}".format(projectsData.shape[0]), color.BOLD);
printStyle("Number of attributes in projects data:{}".format(projectsData.shape[1]), color.BOLD);
equalsBorder(60);
printStyle("Number of data points in resources data: {}".format(resourcesData.shape[0]), color.BOLD);
printStyle("Number of attributes in resources data: {}".format(resourcesData.shape[1]), color.BOLD);
approvedProjects = projectsData[projectsData.project_is_approved == 1].shape[0];
unApprovedProjects = projectsData[projectsData.project_is_approved == 0].shape[0];
totalProjects = projectsData.shape[0];
print("Number of projects approved for funding: {}, ({})".format(approvedProjects, (approvedProjects / totalProjects) * 100));
print("Number of projects not approved for funding: {}, ({})".format(unApprovedProjects, (unApprovedProjects / totalProjects) * 100));
# Pie chart representation
# Citation: https://matplotlib.org/gallery/pie_and_polar_charts/pie_features.html
labels = ["Approved Projects", "UnApproved Projects"];
explode = (0, 0.1);
sizes = [approvedProjects, unApprovedProjects];
figure, ax = plt.subplots();
ax.pie(sizes, labels = labels, explode = explode, autopct = '%1.1f%%', shadow = True, startangle = 90);
ax.axis('equal');
plt.rcParams['figure.figsize'] = (7, 7);
plt.show();
groupedByStatesData = pd.DataFrame(projectsData.groupby(['school_state'])['project_is_approved'].apply(np.mean)).reset_index();
groupedByStatesData.columns = ['state_code', 'number_of_proposals'];
groupedByStatesData = groupedByStatesData.sort_values(by=['number_of_proposals'], ascending = True);
printStyle("5 States with lowest percentage of project approvals:", color.BOLD);
equalsBorder(60);
groupedByStatesData.head(5)
printStyle("5 states with highest percentage of project approvals: ", color.BOLD);
equalsBorder(60);
groupedByStatesData.tail(5).iloc[::-1]
def univariateBarPlots(data, col1, col2 = 'project_is_approved', orientation = 'vertical', plot = True):
groupedData = data.groupby(col1);
# Count number of zeros in dataframe python: https://stackoverflow.com/a/51540521/4084039
tempData = pd.DataFrame(groupedData[col2].agg(lambda x: x.eq(1).sum())).reset_index();
tempData['total'] = pd.DataFrame(groupedData[col2].agg({'total': 'count'})).reset_index()['total'];
tempData['approval_rate'] = pd.DataFrame(groupedData[col2].agg({'approval_rate': 'mean'})).reset_index()['approval_rate'];
tempData.sort_values(by=['total'], inplace = True, ascending = False);
tempDataWithTotalAndCol2 = tempData[['total', col2, col1]]
if plot:
if(orientation == 'vertical'):
tempDataWithTotalAndCol2.plot(x = col1, align= 'center', kind = 'bar', title = "Number of projects approved vs rejected", figsize = (20, 6), stacked = True, rot = 0);
else:
tempDataWithTotalAndCol2.plot(x = col1, align= 'center', kind = 'barh', title = "Number of projects approved vs rejected", width = 0.8, figsize = (23, 20), stacked = True);
return tempData;
statesCharacteristicsData = univariateBarPlots(projectsData, 'school_state', 'project_is_approved', orientation = 'vertical');
printStyle("Top 5 states with high project proposals", color.BOLD)
equalsBorder(60);
statesCharacteristicsData.head(5)
printStyle("Top 5 states with least project proposals", color.BOLD)
equalsBorder(60);
statesCharacteristicsData.tail(5)
teacherPrefixCharacteristicsData = univariateBarPlots(projectsData, 'teacher_prefix', 'project_is_approved', orientation = 'vertical', plot = True);
printStyle("Project proposals characteristics based on types of persons", color.BOLD);
equalsBorder(60);
teacherPrefixCharacteristicsData
gradeCharacteristicsData = univariateBarPlots(projectsData, 'project_grade_category', 'project_is_approved', orientation = 'vertical', plot = True);
printStyle("Project proposal characteristics based on grades", color.BOLD);
equalsBorder(60);
gradeCharacteristicsData
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
def cleanCategories(subjectCategories):
cleanedCategories = []
for subjectCategory in tqdm(subjectCategories):
tempCategory = ""
for category in subjectCategory.split(","):
if 'The' in category.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
category = category.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
category = category.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
tempCategory += category.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
tempCategory = tempCategory.replace('&','_')
cleanedCategories.append(tempCategory)
return cleanedCategories
# projectDataWithCleanedCategories = pd.DataFrame(projectsData);
subjectCategories = list(projectsData.project_subject_categories);
cleanedCategories = cleanCategories(subjectCategories);
printStyle("Sample categories: ", color.BOLD);
equalsBorder(60);
print(subjectCategories[0:5]);
equalsBorder(60);
printStyle("Sample cleaned categories: ", color.BOLD);
equalsBorder(60);
print(cleanedCategories[0:5]);
projectsData['cleaned_categories'] = cleanedCategories;
projectsData.head(5)
categoriesCharacteristicsData = univariateBarPlots(projectsData, 'cleaned_categories', 'project_is_approved', orientation = 'horizontal', plot = True);
print("Project proposals characteristics based on subject categories");
equalsBorder(60);
categoriesCharacteristicsData.head(5)
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
categoriesCounter = Counter()
for subjectCategory in projectsData.cleaned_categories.values:
categoriesCounter.update(subjectCategory.split());
categoriesCounter
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
categoriesDictionary = dict(categoriesCounter);
sortedCategoriesDictionary = dict(sorted(categoriesDictionary.items(), key = lambda keyValue: keyValue[1]));
sortedCategoriesData = pd.DataFrame.from_dict(sortedCategoriesDictionary, columns = ['subject_categories'], orient = 'index');
printStyle("Number of projects by Subject Categories: ", color.BOLD);
equalsBorder(60);
sortedCategoriesData
sortedCategoriesData.plot(kind = 'bar', title = 'Number of projects by subject categories');
subjectSubCategories = projectsData.project_subject_subcategories;
cleanedSubCategories = cleanCategories(subjectSubCategories);
printStyle("Sample subject sub categories: ", color.BOLD);
equalsBorder(70);
print(subjectSubCategories[0:5]);
equalsBorder(70);
printStyle("Sample cleaned subject sub categories: ", color.BOLD);
equalsBorder(70);
print(cleanedSubCategories[0:5]);
projectsData['cleaned_sub_categories'] = cleanedSubCategories;
projectsData.head(5)
subCategoriesCharacteristicsData = univariateBarPlots(projectsData, 'cleaned_sub_categories', 'project_is_approved', plot = False);
print("Project proposals characteristics based on subject sub categories");
equalsBorder(60);
subCategoriesCharacteristicsData.head(5)
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
subjectsSubCategoriesCounter = Counter();
for subCategory in projectsData.cleaned_sub_categories:
subjectsSubCategoriesCounter.update(subCategory.split());
subjectsSubCategoriesCounter
# dict sort by value python: https://stackoverflow.com/a/613218/4084039
dictionarySubCategories = dict(subjectsSubCategoriesCounter);
sortedDictionarySubCategories = dict(sorted(dictionarySubCategories.items(), key = lambda keyValue: keyValue[1]));
sortedSubCategoriesData = pd.DataFrame.from_dict(sortedDictionarySubCategories, columns = ['subject_sub_categories'], orient = 'index');
sortedSubCategoriesData.plot(kind = 'bar', title = "Number of projects by subject sub categories");
printStyle("Number of projects sorted by subject sub categories: ", color.BOLD);
equalsBorder(70);
sortedSubCategoriesData
#How to calculate number of words in a string in DataFrame: https://stackoverflow.com/a/37483537/4084039
wordCounts = projectsData['project_title'].str.split().apply(len).value_counts();
dictionaryWordCounts = dict(wordCounts);
dictionaryWordCounts = dict(sorted(dictionaryWordCounts.items(), key = lambda kv: kv[1]));
wordCountsData = pd.DataFrame.from_dict({'number_of_words': list(dictionaryWordCounts.keys()), 'number_of_projects': list(dictionaryWordCounts.values())}).sort_values(by = ['number_of_projects']);
wordCountsData.plot(kind = 'bar', title = "Number of projects vs Number of words in project title", legend = False);
plt.xlabel('Number of words');
plt.ylabel('Number of projects');
wordCountsData
approvedNumberOfProjects = projectsData[projectsData.project_is_approved == 1]['project_title'].str.split().apply(len);
approvedNumberOfProjects = approvedNumberOfProjects.values
unApprovedNumberOfProjects = projectsData[projectsData.project_is_approved == 0]['project_title'].str.split().apply(len);
unApprovedNumberOfProjects = unApprovedNumberOfProjects.values
plt.boxplot([approvedNumberOfProjects, unApprovedNumberOfProjects]);
plt.grid();
plt.xticks([1, 2], ['Approved Projects', 'UnApproved Projects']);
plt.ylabel('Number of words in title');
plt.show();
plt.figure(figsize = (10, 6));
sbrn.kdeplot(approvedNumberOfProjects, label = "Approved Projects", bw = 0.6);
sbrn.kdeplot(unApprovedNumberOfProjects, label = "UnApproved Projects", bw = 0.6);
plt.legend();
plt.show();
projectsData['project_essay'] = projectsData['project_essay_1'].map(str) + projectsData['project_essay_2'].map(str) + \
projectsData['project_essay_3'].map(str) + projectsData['project_essay_4'].map(str);
projectsData.head(5)
approvedNumberOfProjects = projectsData[projectsData.project_is_approved == 1]['project_essay'].str.split().apply(len);
approvedNumberOfProjects = approvedNumberOfProjects.values
unApprovedNumberOfProjects = projectsData[projectsData.project_is_approved == 0]['project_essay'].str.split().apply(len);
unApprovedNumberOfProjects = unApprovedNumberOfProjects.values
plt.boxplot([approvedNumberOfProjects, unApprovedNumberOfProjects]);
plt.grid();
plt.xticks([1, 2], ['Approved Projects', 'UnApproved Projects']);
plt.ylabel('Number of words in project essay');
plt.show();
plt.figure(figsize = (10, 6));
sbrn.kdeplot(approvedNumberOfProjects, label = "Approved Projects", bw = 5);
sbrn.kdeplot(unApprovedNumberOfProjects, label = "UnApproved Projects", bw = 5);
plt.legend();
plt.show();
projectsData.head(5)
resourcesData.head(5)
# https://stackoverflow.com/questions/22407798/how-to-reset-a-dataframes-indexes-for-all-groups-in-one-step
priceAndQuantityData = resourcesData.groupby('id').agg({'price': 'sum', 'quantity': 'sum'}).reset_index();
priceAndQuantityData.head(5)
projectsData.shape
projectsData = pd.merge(projectsData, priceAndQuantityData, on = 'id', how = 'left');
print(projectsData.shape);
projectsData.head(3)
projectsData[projectsData['id'] == 'p253737']
priceAndQuantityData[priceAndQuantityData['id'] == 'p253737']
approvedProjectsPrice = projectsData[projectsData['project_is_approved'] == 1].price;
unApprovedProjectsPrice = projectsData[projectsData['project_is_approved'] == 0].price;
plt.boxplot([approvedProjectsPrice, unApprovedProjectsPrice]);
plt.grid();
plt.xticks([1, 2], ['Approved Projects', 'UnApproved Projects']);
plt.ylabel('Cost per project');
plt.show();
plt.title("Kde plot based on cost per project");
sbrn.kdeplot(approvedProjectsPrice, label = "Approved Projects", bw = 0.6);
sbrn.kdeplot(unApprovedProjectsPrice, label = "UnApproved Projects", bw = 0.6);
plt.legend();
plt.show();
pricePercentilesApproved = [round(np.percentile(approvedProjectsPrice, percentile), 3) for percentile in np.arange(0, 100, 5)];
pricePercentilesUnApproved = [round(np.percentile(unApprovedProjectsPrice, percentile), 3) for percentile in np.arange(0, 100, 5)];
percentileValuePricesData = pd.DataFrame({'Percentile': np.arange(0, 100, 5), 'Approved projects': pricePercentilesApproved, 'UnApproved Projects': pricePercentilesUnApproved});
percentileValuePricesData
approvedProjectsQuantity = projectsData[projectsData['project_is_approved'] == 1].quantity;
unApprovedProjectsQuantity = projectsData[projectsData['project_is_approved'] == 0].quantity;
plt.boxplot([approvedProjectsQuantity, unApprovedProjectsQuantity]);
plt.grid();
plt.xticks([1, 2], ['Approved Projects', 'UnApproved Projects']);
plt.ylabel('Quantity of resources per project');
plt.show();
plt.title("Kde plot based on quantity of resources per project");
sbrn.kdeplot(approvedProjectsQuantity, label = "Approved Projects", bw = 0.6);
sbrn.kdeplot(unApprovedProjectsQuantity, label = "UnApproved Projects", bw = 0.6);
plt.legend();
plt.show();
quantityPercentilesApproved = [round(np.percentile(approvedProjectsQuantity, percentile), 3) for percentile in np.arange(0, 100, 5)];
quantityPercentilesUnApproved = [round(np.percentile(unApprovedProjectsQuantity, percentile), 3) for percentile in np.arange(0, 100, 5)];
percentileValueQuantitiesData = pd.DataFrame({'Percentile': np.arange(0, 100, 5), 'Approved projects': quantityPercentilesApproved, 'UnApproved Projects': quantityPercentilesUnApproved});
percentileValueQuantitiesData
sbrn.set_style('whitegrid');
sbrn.FacetGrid(projectsData, hue = 'project_is_approved', size = 6) \
.map(plt.scatter, 'price', 'quantity') \
.add_legend();
plt.title("Scatter plot between price and quantity based project approval and rejection");
plt.show();
projectsData.head(5)
previouslyPostedApprovedNumberData = projectsData.groupby('teacher_number_of_previously_posted_projects')['project_is_approved'].agg(lambda x: x.eq(1).sum()).reset_index();
previouslyPostedRejectedNumberData = projectsData.groupby('teacher_number_of_previously_posted_projects')['project_is_approved'].agg(lambda x: x.eq(0).sum()).reset_index();
print("Total number of projects approved: ", len(projectsData[projectsData['project_is_approved'] == 1]));
print("Total number of projects rejected: ", len(projectsData[projectsData['project_is_approved'] == 0]));
print("Number of projects approved categorized by previously_posted: ", previouslyPostedApprovedNumberData['project_is_approved'].sum());
print("Number of projects rejected categorized by previously_posted: ", previouslyPostedRejectedNumberData['project_is_approved'].sum());
previouslyPostedNumberData = pd.merge(previouslyPostedApprovedNumberData, previouslyPostedRejectedNumberData, on = 'teacher_number_of_previously_posted_projects', how = 'inner');
previouslyPostedNumberData.head(5)
plt.figure(figsize = (20, 8));
plt.bar(previouslyPostedNumberData.teacher_number_of_previously_posted_projects, previouslyPostedNumberData.project_is_approved_x);
plt.bar(previouslyPostedNumberData.teacher_number_of_previously_posted_projects, previouslyPostedNumberData.project_is_approved_y);
plt.show();
previouslyPostedApprovedData = projectsData[projectsData['project_is_approved'] == 1].teacher_number_of_previously_posted_projects;
previouslyPostedRejectedData = projectsData[projectsData['project_is_approved'] == 0].teacher_number_of_previously_posted_projects;
plt.boxplot([previouslyPostedApprovedData, previouslyPostedRejectedData]);
plt.grid();
plt.xticks([1, 2], ['Approved Projects', 'Rejected Projects']);
plt.ylabel('Previously posted number of projects');
plt.show();
sbrn.kdeplot(previouslyPostedApprovedData, label = "Approved projects", bw = 1);
sbrn.kdeplot(previouslyPostedRejectedData, label = "Rejected projects", bw = 1);
plt.show();
def stringContainsNumbers(string):
return any([character.isdigit() for character in string])
numericResourceApprovedData = projectsData[(projectsData['project_resource_summary'].apply(stringContainsNumbers) == True) & (projectsData['project_is_approved'] == 1)]
textResourceApprovedData = projectsData[(projectsData['project_resource_summary'].apply(stringContainsNumbers) == False) & (projectsData['project_is_approved'] == 1)]
numericResourceRejectedData = projectsData[(projectsData['project_resource_summary'].apply(stringContainsNumbers) == True) & (projectsData['project_is_approved'] == 0)]
textResourceRejectedData = projectsData[(projectsData['project_resource_summary'].apply(stringContainsNumbers) == False) & (projectsData['project_is_approved'] == 0)]
print("Checking whether numbers in resource summary will be useful for project approval?");
equalsBorder(70);
print("Number of approved projects with numbers in resource summary: ", numericResourceApprovedData.shape[0]);
print("Number of rejected projects with numbers in resource summary: ", numericResourceRejectedData.shape[0]);
print("Number of approved projects without numbers in resource summary: ", textResourceApprovedData.shape[0]);
print("Number of rejected projects without numbers in resource summary: ", textResourceRejectedData.shape[0]);
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# All stopwords that are needed to be removed in the text
stopWords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
'won', "won't", 'wouldn', "wouldn't"]);
def preProcessingWithAndWithoutStopWords(texts):
"""
This function takes list of texts and returns preprocessed list of texts one with
stop words and one without stopwords.
"""
# Variable for storing preprocessed text with stop words
preProcessedTextsWithStopWords = [];
# Variable for storing preprocessed text without stop words
preProcessedTextsWithoutStopWords = [];
# Looping over list of texts for performing pre processing
for text in tqdm(texts, total = len(texts)):
# Removing all links in the text
text = re.sub(r"http\S+", "", text);
# Removing all html tags in the text
text = re.sub(r"<\w+/>", "", text);
text = re.sub(r"<\w+>", "", text);
# https://stackoverflow.com/a/47091490/4084039
# Replacing all below words with adverbs
text = re.sub(r"won't", "will not", text)
text = re.sub(r"can\'t", "can not", text)
text = re.sub(r"n\'t", " not", text)
text = re.sub(r"\'re", " are", text)
text = re.sub(r"\'s", " is", text)
text = re.sub(r"\'d", " would", text)
text = re.sub(r"\'ll", " will", text)
text = re.sub(r"\'t", " not", text)
text = re.sub(r"\'ve", " have", text)
text = re.sub(r"\'m", " am", text)
# Removing backslash symbols in text
text = text.replace('\\r', ' ');
text = text.replace('\\n', ' ');
text = text.replace('\\"', ' ');
# Removing all special characters of text
text = re.sub(r"[^a-zA-Z0-9]+", " ", text);
# Converting whole review text into lower case
text = text.lower();
# adding this preprocessed text without stopwords to list
preProcessedTextsWithStopWords.append(text);
# removing stop words from text
textWithoutStopWords = ' '.join([word for word in text.split() if word not in stopWords]);
# adding this preprocessed text without stopwords to list
preProcessedTextsWithoutStopWords.append(textWithoutStopWords);
return [preProcessedTextsWithStopWords, preProcessedTextsWithoutStopWords];
texts = [projectsData['project_essay'].values[0]]
preProcessedTextsWithStopWords, preProcessedTextsWithoutStopWords = preProcessingWithAndWithoutStopWords(texts);
print("Example project essay without pre-processing: ");
equalsBorder(70);
print(texts);
equalsBorder(70);
print("Example project essay with stop words and pre-processing: ");
equalsBorder(70);
print(preProcessedTextsWithStopWords);
equalsBorder(70);
print("Example project essay without stop words and pre-processing: ");
equalsBorder(70);
print(preProcessedTextsWithoutStopWords);
projectEssays = projectsData['project_essay'];
preProcessedEssaysWithStopWords, preProcessedEssaysWithoutStopWords = preProcessingWithAndWithoutStopWords(projectEssays);
preProcessedEssaysWithoutStopWords[0:3]
projectTitles = projectsData['project_title'];
preProcessedProjectTitlesWithStopWords, preProcessedProjectTitlesWithoutStopWords = preProcessingWithAndWithoutStopWords(projectTitles);
preProcessedProjectTitlesWithoutStopWords[0:5]
pd.DataFrame(projectsData.columns, columns = ['All features in projects data'])
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique cleaned_categories
subjectsCategoriesVectorizer = CountVectorizer(vocabulary = list(sortedCategoriesDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with cleaned_categories values
subjectsCategoriesVectorizer.fit(projectsData['cleaned_categories'].values);
# Vectorizing categories using one-hot-encoding
categoriesVectors = subjectsCategoriesVectorizer.transform(projectsData['cleaned_categories'].values);
print("Features used in vectorizing categories: ");
equalsBorder(70);
print(subjectsCategoriesVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of cleaned_categories matrix after vectorization(one-hot-encoding): ", categoriesVectors.shape);
equalsBorder(70);
print("Sample vectors of categories: ");
equalsBorder(70);
print(categoriesVectors[0:4])
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique cleaned_sub_categories
subjectsSubCategoriesVectorizer = CountVectorizer(vocabulary = list(sortedDictionarySubCategories.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with cleaned_sub_categories values
subjectsSubCategoriesVectorizer.fit(projectsData['cleaned_sub_categories'].values);
# Vectorizing sub categories using one-hot-encoding
subCategoriesVectors = subjectsSubCategoriesVectorizer.transform(projectsData['cleaned_sub_categories'].values);
print("Features used in vectorizing subject sub categories: ");
equalsBorder(70);
print(subjectsSubCategoriesVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of cleaned_categories matrix after vectorization(one-hot-encoding): ", subCategoriesVectors.shape);
equalsBorder(70);
print("Sample vectors of categories: ");
equalsBorder(70);
print(subCategoriesVectors[0:4])
def giveCounter(data):
counter = Counter();
for dataValue in data:
counter.update(str(dataValue).split());
return counter
giveCounter(projectsData['teacher_prefix'].values)
projectsData = projectsData.dropna(subset = ['teacher_prefix']);
projectsData.shape
teacherPrefixDictionary = dict(giveCounter(projectsData['teacher_prefix'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique teacher_prefix
teacherPrefixVectorizer = CountVectorizer(vocabulary = list(teacherPrefixDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with teacher_prefix values
teacherPrefixVectorizer.fit(projectsData['teacher_prefix'].values);
# Vectorizing teacher_prefix using one-hot-encoding
teacherPrefixVectors = teacherPrefixVectorizer.transform(projectsData['teacher_prefix'].values);
print("Features used in vectorizing teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of teacher_prefix matrix after vectorization(one-hot-encoding): ", teacherPrefixVectors.shape);
equalsBorder(70);
print("Sample vectors of teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectors[0:100]);
teacherPrefixes = [prefix.replace('.', '') for prefix in projectsData['teacher_prefix'].values];
teacherPrefixes[0:5]
projectsData['teacher_prefix'] = teacherPrefixes;
projectsData.head(3)
teacherPrefixDictionary = dict(giveCounter(projectsData['teacher_prefix'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique teacher_prefix
teacherPrefixVectorizer = CountVectorizer(vocabulary = list(teacherPrefixDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with teacher_prefix values
teacherPrefixVectorizer.fit(projectsData['teacher_prefix'].values);
# Vectorizing teacher_prefix using one-hot-encoding
teacherPrefixVectors = teacherPrefixVectorizer.transform(projectsData['teacher_prefix'].values);
print("Features used in vectorizing teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of teacher_prefix matrix after vectorization(one-hot-encoding): ", teacherPrefixVectors.shape);
equalsBorder(70);
print("Sample vectors of teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectors[0:4]);
schoolStateDictionary = dict(giveCounter(projectsData['school_state'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique school states
schoolStateVectorizer = CountVectorizer(vocabulary = list(schoolStateDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with school_state values
schoolStateVectorizer.fit(projectsData['school_state'].values);
# Vectorizing school_state using one-hot-encoding
schoolStateVectors = schoolStateVectorizer.transform(projectsData['school_state'].values);
print("Features used in vectorizing school_state: ");
equalsBorder(70);
print(schoolStateVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of school_state matrix after vectorization(one-hot-encoding): ", schoolStateVectors.shape);
equalsBorder(70);
print("Sample vectors of school_state: ");
equalsBorder(70);
print(schoolStateVectors[0:4]);
giveCounter(projectsData['project_grade_category'])
cleanedGrades = []
for grade in projectsData['project_grade_category'].values:
grade = grade.replace(' ', '');
grade = grade.replace('-', 'to');
cleanedGrades.append(grade);
cleanedGrades[0:4]
projectsData['project_grade_category'] = cleanedGrades
projectsData.head(4)
projectGradeDictionary = dict(giveCounter(projectsData['project_grade_category'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique project grade categories
projectGradeVectorizer = CountVectorizer(vocabulary = list(projectGradeDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with project_grade_category values
projectGradeVectorizer.fit(projectsData['project_grade_category'].values);
# Vectorizing project_grade_category using one-hot-encoding
projectGradeVectors = projectGradeVectorizer.transform(projectsData['project_grade_category'].values);
print("Features used in vectorizing project_grade_category: ");
equalsBorder(70);
print(projectGradeVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of school_state matrix after vectorization(one-hot-encoding): ", projectGradeVectors.shape);
equalsBorder(70);
print("Sample vectors of school_state: ");
equalsBorder(70);
print(projectGradeVectors[0:4]);
projectsDataSub = projectsData[0:40000];
preProcessedEssaysWithoutStopWordsSub = preProcessedEssaysWithoutStopWords[0:40000];
preProcessedProjectTitlesWithoutStopWordsSub = preProcessedProjectTitlesWithoutStopWords[0:40000];
# Initializing countvectorizer for bag of words vectorization of preprocessed project essays
bowEssayVectorizer = CountVectorizer(min_df = 10);
# Transforming the preprocessed essays to bag of words vectors
bowEssayModel = bowEssayVectorizer.fit_transform(preProcessedEssaysWithoutStopWordsSub);
print("Some of the Features used in vectorizing preprocessed essays: ");
equalsBorder(70);
print(bowEssayVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed essay matrix after vectorization: ", bowEssayModel.shape);
equalsBorder(70);
print("Sample bag-of-words vector of preprocessed essay: ");
equalsBorder(70);
print(bowEssayModel[0])
# Initializing countvectorizer for bag of words vectorization of preprocessed project titles
bowTitleVectorizer = CountVectorizer(min_df = 10);
# Transforming the preprocessed project titles to bag of words vectors
bowTitleModel = bowTitleVectorizer.fit_transform(preProcessedProjectTitlesWithoutStopWordsSub);
print("Some of the Features used in vectorizing preprocessed titles: ");
equalsBorder(70);
print(bowTitleVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after vectorization: ", bowTitleModel.shape);
equalsBorder(70);
print("Sample bag-of-words vector of preprocessed title: ");
equalsBorder(70);
print(bowTitleModel[0])
# Intializing tfidf vectorizer for tf-idf vectorization of preprocessed project essays
tfIdfEssayVectorizer = TfidfVectorizer(min_df = 10);
# Transforming the preprocessed project essays to tf-idf vectors
tfIdfEssayModel = tfIdfEssayVectorizer.fit_transform(preProcessedEssaysWithoutStopWordsSub);
print("Some of the Features used in tf-idf vectorizing preprocessed essays: ");
equalsBorder(70);
print(tfIdfEssayVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after tf-idf vectorization: ", tfIdfEssayModel.shape);
equalsBorder(70);
print("Sample Tf-Idf vector of preprocessed essay: ");
equalsBorder(70);
print(tfIdfEssayModel[0])
# Intializing tfidf vectorizer for tf-idf vectorization of preprocessed project titles
tfIdfTitleVectorizer = TfidfVectorizer(min_df = 10);
# Transforming the preprocessed project titles to tf-idf vectors
tfIdfTitleModel = tfIdfTitleVectorizer.fit_transform(preProcessedProjectTitlesWithoutStopWordsSub);
print("Some of the Features used in tf-idf vectorizing preprocessed titles: ");
equalsBorder(70);
print(tfIdfTitleVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after tf-idf vectorization: ", tfIdfTitleModel.shape);
equalsBorder(70);
print("Sample Tf-Idf vector of preprocessed title: ");
equalsBorder(70);
print(tfIdfTitleModel[0])
# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
# We should have glove_vectors file for creating below model
with open('glove_vectors', 'rb') as f:
gloveModel = pickle.load(f)
gloveWords = set(gloveModel.keys())
print("Glove vector of sample word: ");
equalsBorder(70);
print(gloveModel['technology']);
equalsBorder(70);
print("Shape of glove vector: ", gloveModel['technology'].shape);
def getWord2VecVectors(texts):
word2VecTextsVectors = [];
for preProcessedText in tqdm(texts):
word2VecTextVector = np.zeros(300);
numberOfWordsInText = 0;
for word in preProcessedText.split():
if word in gloveWords:
word2VecTextVector += gloveModel[word];
numberOfWordsInText += 1;
if numberOfWordsInText != 0:
word2VecTextVector = word2VecTextVector / numberOfWordsInText;
word2VecTextsVectors.append(word2VecTextVector);
return word2VecTextsVectors;
word2VecEssaysVectors = getWord2VecVectors(preProcessedEssaysWithoutStopWords);
print("Shape of Word2Vec vectorization matrix of essays: {},{}".format(len(word2VecEssaysVectors), len(word2VecEssaysVectors[0])));
equalsBorder(70);
print("Sample essay: ");
equalsBorder(70);
print(preProcessedEssaysWithoutStopWords[0]);
equalsBorder(70);
print("Word2Vec vector of sample essay: ");
equalsBorder(70);
print(word2VecEssaysVectors[0]);
word2VecTitlesVectors = getWord2VecVectors(preProcessedProjectTitlesWithoutStopWords);
print("Shape of Word2Vec vectorization matrix of project titles: {}, {}".format(len(word2VecTitlesVectors), len(word2VecTitlesVectors[0])));
equalsBorder(70);
print("Sample title: ");
equalsBorder(70);
print(preProcessedProjectTitlesWithoutStopWords[0]);
equalsBorder(70);
print("Word2Vec vector of sample title: ");
equalsBorder(70);
print(word2VecTitlesVectors[0]);
# Initializing tfidf vectorizer
tfIdfEssayTempVectorizer = TfidfVectorizer();
# Vectorizing preprocessed essays using tfidf vectorizer initialized above
tfIdfEssayTempVectorizer.fit(preProcessedEssaysWithoutStopWords);
# Saving dictionary in which each word is key and it's idf is value
tfIdfEssayDictionary = dict(zip(tfIdfEssayTempVectorizer.get_feature_names(), list(tfIdfEssayTempVectorizer.idf_)));
# Creating set of all unique words used by tfidf vectorizer
tfIdfEssayWords = set(tfIdfEssayTempVectorizer.get_feature_names());
# Creating list to save tf-idf weighted vectors of essays
tfIdfWeightedWord2VecEssaysVectors = [];
# Iterating over each essay
for essay in tqdm(preProcessedEssaysWithoutStopWords):
# Sum of tf-idf values of all words in a particular essay
cumulativeSumTfIdfWeightOfEssay = 0;
# Tf-Idf weighted word2vec vector of a particular essay
tfIdfWeightedWord2VecEssayVector = np.zeros(300);
# Splitting essay into list of words
splittedEssay = essay.split();
# Iterating over each word
for word in splittedEssay:
# Checking if word is in glove words and set of words used by tfIdf essay vectorizer
if (word in gloveWords) and (word in tfIdfEssayWords):
# Tf-Idf value of particular word in essay
tfIdfValueWord = tfIdfEssayDictionary[word] * (essay.count(word) / len(splittedEssay));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecEssayVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfEssay += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfEssay != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecEssayVector = tfIdfWeightedWord2VecEssayVector / cumulativeSumTfIdfWeightOfEssay;
# Appending the above calculated tf-idf weighted vector of particular essay to list of vectors of essays
tfIdfWeightedWord2VecEssaysVectors.append(tfIdfWeightedWord2VecEssayVector);
print("Shape of Tf-Idf weighted Word2Vec vectorization matrix of project essays: {}, {}".format(len(tfIdfWeightedWord2VecEssaysVectors), len(tfIdfWeightedWord2VecEssaysVectors[0])));
equalsBorder(70);
print("Sample Essay: ");
equalsBorder(70);
print(preProcessedEssaysWithoutStopWords[0]);
equalsBorder(70);
print("Tf-Idf Weighted Word2Vec vector of sample essay: ");
equalsBorder(70);
print(tfIdfWeightedWord2VecEssaysVectors[0]);
# Initializing tfidf vectorizer
tfIdfTitleTempVectorizer = TfidfVectorizer();
# Vectorizing preprocessed titles using tfidf vectorizer initialized above
tfIdfTitleTempVectorizer.fit(preProcessedProjectTitlesWithoutStopWords);
# Saving dictionary in which each word is key and it's idf is value
tfIdfTitleDictionary = dict(zip(tfIdfTitleTempVectorizer.get_feature_names(), list(tfIdfTitleTempVectorizer.idf_)));
# Creating set of all unique words used by tfidf vectorizer
tfIdfTitleWords = set(tfIdfTitleTempVectorizer.get_feature_names());
# Creating list to save tf-idf weighted vectors of project titles
tfIdfWeightedWord2VecTitlesVectors = [];
# Iterating over each title
for title in tqdm(preProcessedProjectTitlesWithoutStopWords):
# Sum of tf-idf values of all words in a particular project title
cumulativeSumTfIdfWeightOfTitle = 0;
# Tf-Idf weighted word2vec vector of a particular project title
tfIdfWeightedWord2VecTitleVector = np.zeros(300);
# Splitting title into list of words
splittedTitle = title.split();
# Iterating over each word
for word in splittedTitle:
# Checking if word is in glove words and set of words used by tfIdf title vectorizer
if (word in gloveWords) and (word in tfIdfTitleWords):
# Tf-Idf value of particular word in title
tfIdfValueWord = tfIdfTitleDictionary[word] * (title.count(word) / len(splittedTitle));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecTitleVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfTitle += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfTitle != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecTitleVector = tfIdfWeightedWord2VecTitleVector / cumulativeSumTfIdfWeightOfTitle;
# Appending the above calculated tf-idf weighted vector of particular title to list of vectors of project titles
tfIdfWeightedWord2VecTitlesVectors.append(tfIdfWeightedWord2VecTitleVector);
print("Shape of Tf-Idf weighted Word2Vec vectorization matrix of project titles: {}, {}".format(len(tfIdfWeightedWord2VecTitlesVectors), len(tfIdfWeightedWord2VecTitlesVectors[0])));
equalsBorder(70);
print("Sample Title: ");
equalsBorder(70);
print(preProcessedProjectTitlesWithoutStopWords[0]);
equalsBorder(70);
print("Tf-Idf Weighted Word2Vec vector of sample title: ");
equalsBorder(70);
print(tfIdfWeightedWord2VecTitlesVectors[0]);
# Standardizing the price data using StandardScaler(Uses mean and std for standardization)
priceScaler = StandardScaler();
priceScaler.fit(projectsData['price'].values.reshape(-1, 1));
priceStandardized = priceScaler.transform(projectsData['price'].values.reshape(-1, 1));
print("Shape of standardized matrix of prices: ", priceStandardized.shape);
equalsBorder(70);
print("Sample original prices: ");
equalsBorder(70);
print(projectsData['price'].values[0:5]);
print("Sample standardized prices: ");
equalsBorder(70);
print(priceStandardized[0:5]);
# Standardizing the quantity data using StandardScaler(Uses mean and std for standardization)
quantityScaler = StandardScaler();
quantityScaler.fit(projectsData['quantity'].values.reshape(-1, 1));
quantityStandardized = quantityScaler.transform(projectsData['quantity'].values.reshape(-1, 1));
print("Shape of standardized matrix of quantities: ", quantityStandardized.shape);
equalsBorder(70);
print("Sample original quantities: ");
equalsBorder(70);
print(projectsData['quantity'].values[0:5]);
print("Sample standardized quantities: ");
equalsBorder(70);
print(quantityStandardized[0:5]);
# Standardizing the teacher_number_of_previously_posted_projects data using StandardScaler(Uses mean and std for standardization)
previouslyPostedScaler = StandardScaler();
previouslyPostedScaler.fit(projectsData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
previouslyPostedStandardized = previouslyPostedScaler.transform(projectsData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
print("Shape of standardized matrix of teacher_number_of_previously_posted_projects: ", previouslyPostedStandardized.shape);
equalsBorder(70);
print("Sample original quantities: ");
equalsBorder(70);
print(projectsData['teacher_number_of_previously_posted_projects'].values[0:5]);
print("Sample standardized teacher_number_of_previously_posted_projects: ");
equalsBorder(70);
print(previouslyPostedStandardized[0:5]);
numberOfPoints = 6000;
# Categorical data
categoriesVectorsSub = categoriesVectors[0:numberOfPoints];
subCategoriesVectorsSub = subCategoriesVectors[0:numberOfPoints];
teacherPrefixVectorsSub = teacherPrefixVectors[0:numberOfPoints];
schoolStateVectorsSub = schoolStateVectors[0:numberOfPoints];
projectGradeVectorsSub = projectGradeVectors[0:numberOfPoints];
# Text data
bowEssayModelSub = bowEssayModel[0:numberOfPoints];
bowTitleModelSub = bowTitleModel[0:numberOfPoints];
tfIdfEssayModelSub = tfIdfEssayModel[0:numberOfPoints];
tfIdfTitleModelSub = tfIdfTitleModel[0:numberOfPoints];
word2VecEssaysVectorsSub = word2VecEssaysVectors[0:numberOfPoints];
word2VecTitlesVectorsSub = word2VecTitlesVectors[0:numberOfPoints];
tfIdfWeightedWord2VecEssaysVectorsSub = tfIdfWeightedWord2VecEssaysVectors[0:numberOfPoints];
tfIdfWeightedWord2VecTitlesVectorsSub = tfIdfWeightedWord2VecTitlesVectors[0:numberOfPoints];
# Numerical data
priceStandardizedSub = priceStandardized[0:numberOfPoints];
quantityStandardizedSub = quantityStandardized[0:numberOfPoints];
previouslyPostedStandardizedSub = previouslyPostedStandardized[0:numberOfPoints];
classesDataSub = projectsData['project_is_approved'][0:numberOfPoints].values
classesDataSub.shape
bowTitleAndOthers = hstack((bowTitleModelSub, categoriesVectorsSub, subCategoriesVectorsSub, teacherPrefixVectorsSub, schoolStateVectorsSub, projectGradeVectorsSub, priceStandardizedSub, previouslyPostedStandardizedSub));
bowTitleAndOthers.shape
perplexityValues = [5, 10, 30, 50, 80, 100]
for perplexityValue in perplexityValues:
tsne = TSNE(n_components = 2, perplexity = perplexityValue, learning_rate = 200);
bowTitleAndOthersEmbedded = tsne.fit_transform(bowTitleAndOthers.toarray());
bowTitleAndOthersTsneData = np.hstack((bowTitleAndOthersEmbedded, classesDataSub.reshape(-1, 1)));
bowTitleAndOthersTsneDataFrame = pd.DataFrame(bowTitleAndOthersTsneData, columns = ['Dimension1', 'Dimension2', 'Class']);
colors = {0.0:'red', 1.0:'green'}
plt.title("TSNE plot for merged data of BoW Title and Categorical, Numerical features - Perplexity({})".format(perplexityValue));
plt.scatter(bowTitleAndOthersTsneDataFrame['Dimension1'], bowTitleAndOthersTsneDataFrame['Dimension2'], c = bowTitleAndOthersTsneDataFrame['Class'].apply(lambda x: colors[x]));
plt.show();
tfIdfTitleAndOthers = hstack((tfIdfTitleModelSub, categoriesVectorsSub, subCategoriesVectorsSub, teacherPrefixVectorsSub, schoolStateVectorsSub, projectGradeVectorsSub, priceStandardizedSub, previouslyPostedStandardizedSub));
tfIdfTitleAndOthers.shape
perplexityValues = [5, 10, 30, 50, 80, 100]
for perplexityValue in perplexityValues:
tsne = TSNE(n_components = 2, perplexity = perplexityValue, learning_rate = 200);
tfIdfTitleAndOthersEmbedded = tsne.fit_transform(tfIdfTitleAndOthers.toarray());
tfIdfTitleAndOthersTsneData = np.hstack((tfIdfTitleAndOthersEmbedded, classesDataSub.reshape(-1, 1)));
tfIdfTitleAndOthersTsneDataFrame = pd.DataFrame(tfIdfTitleAndOthersTsneData, columns = ['Dimension1', 'Dimension2', 'Class']);
colors = {0.0:'red', 1.0:'green'}
plt.title("TSNE plot for merged data of Tf-Idf Title and Categorical, Numerical features - Perplexity({})".format(perplexityValue));
plt.scatter(tfIdfTitleAndOthersTsneDataFrame['Dimension1'], tfIdfTitleAndOthersTsneDataFrame['Dimension2'], c = tfIdfTitleAndOthersTsneDataFrame['Class'].apply(lambda x: colors[x]));
plt.show();
word2VecTitleAndOthers = hstack((word2VecTitlesVectorsSub, categoriesVectorsSub, subCategoriesVectorsSub, teacherPrefixVectorsSub, schoolStateVectorsSub, projectGradeVectorsSub, priceStandardizedSub, previouslyPostedStandardizedSub));
word2VecTitleAndOthers.shape
perplexityValues = [5, 10, 30, 50, 80, 100]
for perplexityValue in perplexityValues:
tsne = TSNE(n_components = 2, perplexity = perplexityValue, learning_rate = 200);
word2VecTitleAndOthersEmbedded = tsne.fit_transform(word2VecTitleAndOthers.toarray());
word2VecTitleAndOthersTsneData = np.hstack((word2VecTitleAndOthersEmbedded, classesDataSub.reshape(-1, 1)));
word2VecTitleAndOthersTsneDataFrame = pd.DataFrame(word2VecTitleAndOthersTsneData, columns = ['Dimension1', 'Dimension2', 'Class']);
colors = {0.0:'red', 1.0:'green'}
plt.title("TSNE plot for merged data of Average Word2Vec Title and Categorical, Numerical features - Perplexity({})".format(perplexityValue));
plt.scatter(word2VecTitleAndOthersTsneDataFrame['Dimension1'], word2VecTitleAndOthersTsneDataFrame['Dimension2'], c = word2VecTitleAndOthersTsneDataFrame['Class'].apply(lambda x: colors[x]));
plt.show();
tfIdfWeightedWord2VecTitleAndOthers = hstack((tfIdfWeightedWord2VecTitlesVectorsSub, categoriesVectorsSub, subCategoriesVectorsSub, teacherPrefixVectorsSub, schoolStateVectorsSub, projectGradeVectorsSub, priceStandardizedSub, previouslyPostedStandardizedSub));
tfIdfWeightedWord2VecTitleAndOthers.shape
perplexityValues = [5, 10, 30, 50, 80, 100]
for perplexityValue in perplexityValues:
tsne = TSNE(n_components = 2, perplexity = perplexityValue, learning_rate = 200);
tfIdfWeightedWord2VecTitleAndOthersEmbedded = tsne.fit_transform(tfIdfWeightedWord2VecTitleAndOthers.toarray());
tfIdfWeightedWord2VecTitleAndOthersTsneData = np.hstack((tfIdfWeightedWord2VecTitleAndOthersEmbedded, classesDataSub.reshape(-1, 1)));
tfIdfWeightedWord2VecTitleAndOthersTsneDataFrame = pd.DataFrame(tfIdfWeightedWord2VecTitleAndOthersTsneData, columns = ['Dimension1', 'Dimension2', 'Class']);
colors = {0.0:'red', 1.0:'green'}
plt.title("TSNE plot for merged data of Tf-Idf Weighted Word2Vec Title and Categorical, Numerical features - Perplexity({})".format(perplexityValue));
plt.scatter(tfIdfWeightedWord2VecTitleAndOthersTsneDataFrame['Dimension1'], tfIdfWeightedWord2VecTitleAndOthersTsneDataFrame['Dimension2'], c = tfIdfWeightedWord2VecTitleAndOthersTsneDataFrame['Class'].apply(lambda x: colors[x]));
plt.show();
allFeatures = hstack((bowTitleModelSub, tfIdfTitleModelSub, word2VecTitlesVectorsSub, tfIdfWeightedWord2VecTitlesVectorsSub, categoriesVectorsSub, subCategoriesVectorsSub, teacherPrefixVectorsSub, schoolStateVectorsSub, projectGradeVectorsSub, priceStandardizedSub, previouslyPostedStandardizedSub))
print(allFeatures.shape)
perplexityValues = [5, 10, 30, 50, 80, 100]
for perplexityValue in perplexityValues:
tsne = TSNE(n_components = 2, perplexity = perplexityValue, learning_rate = 200);
allFeaturesEmbedded = tsne.fit_transform(allFeatures.toarray());
allFeaturesTsneData = np.hstack((allFeaturesEmbedded, classesDataSub.reshape(-1, 1)));
allFeaturesTsneDataFrame = pd.DataFrame(allFeaturesTsneData, columns = ['Dimension1', 'Dimension2', 'Class']);
colors = {0.0:'red', 1.0:'green'}
plt.title("TSNE plot for merged data of all vectorized features and Categorical, Numerical features - Perplexity({})".format(perplexityValue));
plt.scatter(allFeaturesTsneDataFrame['Dimension1'], allFeaturesTsneDataFrame['Dimension2'], c = allFeaturesTsneDataFrame['Class'].apply(lambda x: colors[x]));
plt.show();
projectsData = projectsData.dropna(subset = ['teacher_prefix']);
projectsData.shape
classesData = projectsData['project_is_approved']
print(classesData.shape)
trainingAndCrossValidateData, testData, classesTrainingAndCrossValidate, classesTest = cross_validation.train_test_split(projectsData[0:10000], classesData[0:10000], test_size = 0.3, random_state = 0);
trainingData, crossValidateData, classesTraining, classesCrossValidate = cross_validation.train_test_split(trainingAndCrossValidateData, classesTrainingAndCrossValidate, test_size = 0.3, random_state = 0);
print("Shapes of splitted data: ");
equalsBorder(70);
print("trainingAndCrossValidateData shape: ", trainingAndCrossValidateData.shape);
print("classesTrainingAndCrossValidate shape: ", classesTrainingAndCrossValidate.shape);
print("testData shape: ", testData.shape);
print("classesTest: ", classesTest.shape);
print("trainingData shape: ", trainingData.shape);
print("classesTraining shape: ", classesTraining.shape);
print("crossValidateData shape: ", crossValidateData.shape);
print("classesCrossValidate shape: ", classesCrossValidate.shape);
print("Number of negative points: ", trainingData[trainingData['project_is_approved'] == 0].shape);
print("Number of positive points: ", trainingData[trainingData['project_is_approved'] == 1].shape);
negativeData = trainingData[trainingData['project_is_approved'] == 0];
positiveData = trainingData[trainingData['project_is_approved'] == 1];
negativeDataBalanced = resample(negativeData, replace = True, n_samples = 8319, random_state = 44);
trainingData = pd.concat([positiveData, negativeDataBalanced]);
trainingData = shuffle(trainingData);
classesTraining = trainingData['project_is_approved'];
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique cleaned_categories
subjectsCategoriesVectorizer = CountVectorizer(vocabulary = list(sortedCategoriesDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with cleaned_categories values
subjectsCategoriesVectorizer.fit(trainingData['cleaned_categories'].values);
# Vectorizing categories using one-hot-encoding
categoriesVectors = subjectsCategoriesVectorizer.transform(trainingData['cleaned_categories'].values);
print("Features used in vectorizing categories: ");
equalsBorder(70);
print(subjectsCategoriesVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of cleaned_categories matrix after vectorization(one-hot-encoding): ", categoriesVectors.shape);
equalsBorder(70);
print("Sample vectors of categories: ");
equalsBorder(70);
print(categoriesVectors[0:4])
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique cleaned_sub_categories
subjectsSubCategoriesVectorizer = CountVectorizer(vocabulary = list(sortedDictionarySubCategories.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with cleaned_sub_categories values
subjectsSubCategoriesVectorizer.fit(trainingData['cleaned_sub_categories'].values);
# Vectorizing sub categories using one-hot-encoding
subCategoriesVectors = subjectsSubCategoriesVectorizer.transform(trainingData['cleaned_sub_categories'].values);
print("Features used in vectorizing subject sub categories: ");
equalsBorder(70);
print(subjectsSubCategoriesVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of cleaned_categories matrix after vectorization(one-hot-encoding): ", subCategoriesVectors.shape);
equalsBorder(70);
print("Sample vectors of categories: ");
equalsBorder(70);
print(subCategoriesVectors[0:4])
def giveCounter(data):
counter = Counter();
for dataValue in data:
counter.update(str(dataValue).split());
return counter
giveCounter(trainingData['teacher_prefix'].values)
teacherPrefixDictionary = dict(giveCounter(trainingData['teacher_prefix'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique teacher_prefix
teacherPrefixVectorizer = CountVectorizer(vocabulary = list(teacherPrefixDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with teacher_prefix values
teacherPrefixVectorizer.fit(trainingData['teacher_prefix'].values);
# Vectorizing teacher_prefix using one-hot-encoding
teacherPrefixVectors = teacherPrefixVectorizer.transform(trainingData['teacher_prefix'].values);
print("Features used in vectorizing teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of teacher_prefix matrix after vectorization(one-hot-encoding): ", teacherPrefixVectors.shape);
equalsBorder(70);
print("Sample vectors of teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectors[0:100]);
teacherPrefixes = [prefix.replace('.', '') for prefix in trainingData['teacher_prefix'].values];
teacherPrefixes[0:5]
trainingData['teacher_prefix'] = teacherPrefixes;
trainingData.head(3)
teacherPrefixDictionary = dict(giveCounter(trainingData['teacher_prefix'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique teacher_prefix
teacherPrefixVectorizer = CountVectorizer(vocabulary = list(teacherPrefixDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with teacher_prefix values
teacherPrefixVectorizer.fit(trainingData['teacher_prefix'].values);
# Vectorizing teacher_prefix using one-hot-encoding
teacherPrefixVectors = teacherPrefixVectorizer.transform(trainingData['teacher_prefix'].values);
print("Features used in vectorizing teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of teacher_prefix matrix after vectorization(one-hot-encoding): ", teacherPrefixVectors.shape);
equalsBorder(70);
print("Sample vectors of teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectors[0:4]);
schoolStateDictionary = dict(giveCounter(trainingData['school_state'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique school states
schoolStateVectorizer = CountVectorizer(vocabulary = list(schoolStateDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with school_state values
schoolStateVectorizer.fit(trainingData['school_state'].values);
# Vectorizing school_state using one-hot-encoding
schoolStateVectors = schoolStateVectorizer.transform(trainingData['school_state'].values);
print("Features used in vectorizing school_state: ");
equalsBorder(70);
print(schoolStateVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of school_state matrix after vectorization(one-hot-encoding): ", schoolStateVectors.shape);
equalsBorder(70);
print("Sample vectors of school_state: ");
equalsBorder(70);
print(schoolStateVectors[0:4]);
giveCounter(trainingData['project_grade_category'])
cleanedGrades = []
for grade in trainingData['project_grade_category'].values:
grade = grade.replace(' ', '');
grade = grade.replace('-', 'to');
cleanedGrades.append(grade);
cleanedGrades[0:4]
trainingData['project_grade_category'] = cleanedGrades
trainingData.head(4)
projectGradeDictionary = dict(giveCounter(trainingData['project_grade_category'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique project grade categories
projectGradeVectorizer = CountVectorizer(vocabulary = list(projectGradeDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with project_grade_category values
projectGradeVectorizer.fit(trainingData['project_grade_category'].values);
# Vectorizing project_grade_category using one-hot-encoding
projectGradeVectors = projectGradeVectorizer.transform(trainingData['project_grade_category'].values);
print("Features used in vectorizing project_grade_category: ");
equalsBorder(70);
print(projectGradeVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of school_state matrix after vectorization(one-hot-encoding): ", projectGradeVectors.shape);
equalsBorder(70);
print("Sample vectors of school_state: ");
equalsBorder(70);
print(projectGradeVectors[0:4]);
preProcessedEssaysWithStopWords, preProcessedEssaysWithoutStopWords = preProcessingWithAndWithoutStopWords(trainingData['project_essay']);
preProcessedProjectTitlesWithStopWords, preProcessedProjectTitlesWithoutStopWords = preProcessingWithAndWithoutStopWords(trainingData['project_title']);
# Initializing countvectorizer for bag of words vectorization of preprocessed project essays
bowEssayVectorizer = CountVectorizer(min_df = 10);
# Transforming the preprocessed essays to bag of words vectors
bowEssayModel = bowEssayVectorizer.fit_transform(preProcessedEssaysWithoutStopWords);
print("Some of the Features used in vectorizing preprocessed essays: ");
equalsBorder(70);
print(bowEssayVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed essay matrix after vectorization: ", bowEssayModel.shape);
equalsBorder(70);
print("Sample bag-of-words vector of preprocessed essay: ");
equalsBorder(70);
print(bowEssayModel[0])
# Initializing countvectorizer for bag of words vectorization of preprocessed project titles
bowTitleVectorizer = CountVectorizer(min_df = 10);
# Transforming the preprocessed project titles to bag of words vectors
bowTitleModel = bowTitleVectorizer.fit_transform(preProcessedProjectTitlesWithoutStopWords);
print("Some of the Features used in vectorizing preprocessed titles: ");
equalsBorder(70);
print(bowTitleVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after vectorization: ", bowTitleModel.shape);
equalsBorder(70);
print("Sample bag-of-words vector of preprocessed title: ");
equalsBorder(70);
print(bowTitleModel[0])
# Intializing tfidf vectorizer for tf-idf vectorization of preprocessed project essays
tfIdfEssayVectorizer = TfidfVectorizer(min_df = 10);
# Transforming the preprocessed project essays to tf-idf vectors
tfIdfEssayModel = tfIdfEssayVectorizer.fit_transform(preProcessedEssaysWithoutStopWords);
print("Some of the Features used in tf-idf vectorizing preprocessed essays: ");
equalsBorder(70);
print(tfIdfEssayVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after tf-idf vectorization: ", tfIdfEssayModel.shape);
equalsBorder(70);
print("Sample Tf-Idf vector of preprocessed essay: ");
equalsBorder(70);
print(tfIdfEssayModel[0])
# Intializing tfidf vectorizer for tf-idf vectorization of preprocessed project titles
tfIdfTitleVectorizer = TfidfVectorizer(min_df = 10);
# Transforming the preprocessed project titles to tf-idf vectors
tfIdfTitleModel = tfIdfTitleVectorizer.fit_transform(preProcessedProjectTitlesWithoutStopWords);
print("Some of the Features used in tf-idf vectorizing preprocessed titles: ");
equalsBorder(70);
print(tfIdfTitleVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after tf-idf vectorization: ", tfIdfTitleModel.shape);
equalsBorder(70);
print("Sample Tf-Idf vector of preprocessed title: ");
equalsBorder(70);
print(tfIdfTitleModel[0])
# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
# We should have glove_vectors file for creating below model
with open('glove_vectors', 'rb') as f:
gloveModel = pickle.load(f)
gloveWords = set(gloveModel.keys())
print("Glove vector of sample word: ");
equalsBorder(70);
print(gloveModel['technology']);
equalsBorder(70);
print("Shape of glove vector: ", gloveModel['technology'].shape);
def getWord2VecVectors(texts):
word2VecTextsVectors = [];
for preProcessedText in tqdm(texts):
word2VecTextVector = np.zeros(300);
numberOfWordsInText = 0;
for word in preProcessedText.split():
if word in gloveWords:
word2VecTextVector += gloveModel[word];
numberOfWordsInText += 1;
if numberOfWordsInText != 0:
word2VecTextVector = word2VecTextVector / numberOfWordsInText;
word2VecTextsVectors.append(word2VecTextVector);
return word2VecTextsVectors;
word2VecEssaysVectors = getWord2VecVectors(preProcessedEssaysWithoutStopWords);
print("Shape of Word2Vec vectorization matrix of essays: {},{}".format(len(word2VecEssaysVectors), len(word2VecEssaysVectors[0])));
equalsBorder(70);
print("Sample essay: ");
equalsBorder(70);
print(preProcessedEssaysWithoutStopWords[0]);
equalsBorder(70);
print("Word2Vec vector of sample essay: ");
equalsBorder(70);
print(word2VecEssaysVectors[0]);
word2VecTitlesVectors = getWord2VecVectors(preProcessedProjectTitlesWithoutStopWords);
print("Shape of Word2Vec vectorization matrix of project titles: {}, {}".format(len(word2VecTitlesVectors), len(word2VecTitlesVectors[0])));
equalsBorder(70);
print("Sample title: ");
equalsBorder(70);
print(preProcessedProjectTitlesWithoutStopWords[0]);
equalsBorder(70);
print("Word2Vec vector of sample title: ");
equalsBorder(70);
print(word2VecTitlesVectors[0]);
# Initializing tfidf vectorizer
tfIdfEssayTempVectorizer = TfidfVectorizer();
# Vectorizing preprocessed essays using tfidf vectorizer initialized above
tfIdfEssayTempVectorizer.fit(preProcessedEssaysWithoutStopWords);
# Saving dictionary in which each word is key and it's idf is value
tfIdfEssayDictionary = dict(zip(tfIdfEssayTempVectorizer.get_feature_names(), list(tfIdfEssayTempVectorizer.idf_)));
# Creating set of all unique words used by tfidf vectorizer
tfIdfEssayWords = set(tfIdfEssayTempVectorizer.get_feature_names());
# Creating list to save tf-idf weighted vectors of essays
tfIdfWeightedWord2VecEssaysVectors = [];
# Iterating over each essay
for essay in tqdm(preProcessedEssaysWithoutStopWords):
# Sum of tf-idf values of all words in a particular essay
cumulativeSumTfIdfWeightOfEssay = 0;
# Tf-Idf weighted word2vec vector of a particular essay
tfIdfWeightedWord2VecEssayVector = np.zeros(300);
# Splitting essay into list of words
splittedEssay = essay.split();
# Iterating over each word
for word in splittedEssay:
# Checking if word is in glove words and set of words used by tfIdf essay vectorizer
if (word in gloveWords) and (word in tfIdfEssayWords):
# Tf-Idf value of particular word in essay
tfIdfValueWord = tfIdfEssayDictionary[word] * (essay.count(word) / len(splittedEssay));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecEssayVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfEssay += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfEssay != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecEssayVector = tfIdfWeightedWord2VecEssayVector / cumulativeSumTfIdfWeightOfEssay;
# Appending the above calculated tf-idf weighted vector of particular essay to list of vectors of essays
tfIdfWeightedWord2VecEssaysVectors.append(tfIdfWeightedWord2VecEssayVector);
print("Shape of Tf-Idf weighted Word2Vec vectorization matrix of project essays: {}, {}".format(len(tfIdfWeightedWord2VecEssaysVectors), len(tfIdfWeightedWord2VecEssaysVectors[0])));
equalsBorder(70);
print("Sample Essay: ");
equalsBorder(70);
print(preProcessedEssaysWithoutStopWords[0]);
equalsBorder(70);
print("Tf-Idf Weighted Word2Vec vector of sample essay: ");
equalsBorder(70);
print(tfIdfWeightedWord2VecEssaysVectors[0]);
# Initializing tfidf vectorizer
tfIdfTitleTempVectorizer = TfidfVectorizer();
# Vectorizing preprocessed titles using tfidf vectorizer initialized above
tfIdfTitleTempVectorizer.fit(preProcessedProjectTitlesWithoutStopWords);
# Saving dictionary in which each word is key and it's idf is value
tfIdfTitleDictionary = dict(zip(tfIdfTitleTempVectorizer.get_feature_names(), list(tfIdfTitleTempVectorizer.idf_)));
# Creating set of all unique words used by tfidf vectorizer
tfIdfTitleWords = set(tfIdfTitleTempVectorizer.get_feature_names());
# Creating list to save tf-idf weighted vectors of project titles
tfIdfWeightedWord2VecTitlesVectors = [];
# Iterating over each title
for title in tqdm(preProcessedProjectTitlesWithoutStopWords):
# Sum of tf-idf values of all words in a particular project title
cumulativeSumTfIdfWeightOfTitle = 0;
# Tf-Idf weighted word2vec vector of a particular project title
tfIdfWeightedWord2VecTitleVector = np.zeros(300);
# Splitting title into list of words
splittedTitle = title.split();
# Iterating over each word
for word in splittedTitle:
# Checking if word is in glove words and set of words used by tfIdf title vectorizer
if (word in gloveWords) and (word in tfIdfTitleWords):
# Tf-Idf value of particular word in title
tfIdfValueWord = tfIdfTitleDictionary[word] * (title.count(word) / len(splittedTitle));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecTitleVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfTitle += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfTitle != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecTitleVector = tfIdfWeightedWord2VecTitleVector / cumulativeSumTfIdfWeightOfTitle;
# Appending the above calculated tf-idf weighted vector of particular title to list of vectors of project titles
tfIdfWeightedWord2VecTitlesVectors.append(tfIdfWeightedWord2VecTitleVector);
print("Shape of Tf-Idf weighted Word2Vec vectorization matrix of project titles: {}, {}".format(len(tfIdfWeightedWord2VecTitlesVectors), len(tfIdfWeightedWord2VecTitlesVectors[0])));
equalsBorder(70);
print("Sample Title: ");
equalsBorder(70);
print(preProcessedProjectTitlesWithoutStopWords[0]);
equalsBorder(70);
print("Tf-Idf Weighted Word2Vec vector of sample title: ");
equalsBorder(70);
print(tfIdfWeightedWord2VecTitlesVectors[0]);
# Standardizing the price data using StandardScaler(Uses mean and std for standardization)
priceScaler = StandardScaler();
priceScaler.fit(trainingData['price'].values.reshape(-1, 1));
priceStandardized = priceScaler.transform(trainingData['price'].values.reshape(-1, 1));
print("Shape of standardized matrix of prices: ", priceStandardized.shape);
equalsBorder(70);
print("Sample original prices: ");
equalsBorder(70);
print(trainingData['price'].values[0:5]);
print("Sample standardized prices: ");
equalsBorder(70);
print(priceStandardized[0:5]);
# Standardizing the quantity data using StandardScaler(Uses mean and std for standardization)
quantityScaler = StandardScaler();
quantityScaler.fit(trainingData['quantity'].values.reshape(-1, 1));
quantityStandardized = quantityScaler.transform(trainingData['quantity'].values.reshape(-1, 1));
print("Shape of standardized matrix of quantities: ", quantityStandardized.shape);
equalsBorder(70);
print("Sample original quantities: ");
equalsBorder(70);
print(trainingData['quantity'].values[0:5]);
print("Sample standardized quantities: ");
equalsBorder(70);
print(quantityStandardized[0:5]);
# Standardizing the teacher_number_of_previously_posted_projects data using StandardScaler(Uses mean and std for standardization)
previouslyPostedScaler = StandardScaler();
previouslyPostedScaler.fit(trainingData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
previouslyPostedStandardized = previouslyPostedScaler.transform(trainingData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
print("Shape of standardized matrix of teacher_number_of_previously_posted_projects: ", previouslyPostedStandardized.shape);
equalsBorder(70);
print("Sample original quantities: ");
equalsBorder(70);
print(trainingData['teacher_number_of_previously_posted_projects'].values[0:5]);
print("Sample standardized teacher_number_of_previously_posted_projects: ");
equalsBorder(70);
print(previouslyPostedStandardized[0:5]);
numberOfPoints = 16638;
# Categorical data
categoriesVectorsSub = categoriesVectors[0:numberOfPoints];
subCategoriesVectorsSub = subCategoriesVectors[0:numberOfPoints];
teacherPrefixVectorsSub = teacherPrefixVectors[0:numberOfPoints];
schoolStateVectorsSub = schoolStateVectors[0:numberOfPoints];
projectGradeVectorsSub = projectGradeVectors[0:numberOfPoints];
# Text data
bowEssayModelSub = bowEssayModel[0:numberOfPoints];
bowTitleModelSub = bowTitleModel[0:numberOfPoints];
tfIdfEssayModelSub = tfIdfEssayModel[0:numberOfPoints];
tfIdfTitleModelSub = tfIdfTitleModel[0:numberOfPoints];
word2VecEssaysVectorsSub = word2VecEssaysVectors[0:numberOfPoints];
word2VecTitlesVectorsSub = word2VecTitlesVectors[0:numberOfPoints];
tfIdfWeightedWord2VecEssaysVectorsSub = tfIdfWeightedWord2VecEssaysVectors[0:numberOfPoints];
tfIdfWeightedWord2VecTitlesVectorsSub = tfIdfWeightedWord2VecTitlesVectors[0:numberOfPoints];
# Numerical data
priceStandardizedSub = priceStandardized[0:numberOfPoints];
quantityStandardizedSub = quantityStandardized[0:numberOfPoints];
previouslyPostedStandardizedSub = previouslyPostedStandardized[0:numberOfPoints];
def getAvgTfIdfEssayVectors(arrayOfTexts):
# Creating list to save tf-idf weighted vectors of essays
tfIdfWeightedWord2VecEssaysVectors = [];
# Iterating over each essay
for essay in tqdm(arrayOfTexts):
# Sum of tf-idf values of all words in a particular essay
cumulativeSumTfIdfWeightOfEssay = 0;
# Tf-Idf weighted word2vec vector of a particular essay
tfIdfWeightedWord2VecEssayVector = np.zeros(300);
# Splitting essay into list of words
splittedEssay = essay.split();
# Iterating over each word
for word in splittedEssay:
# Checking if word is in glove words and set of words used by tfIdf essay vectorizer
if (word in gloveWords) and (word in tfIdfEssayWords):
# Tf-Idf value of particular word in essay
tfIdfValueWord = tfIdfEssayDictionary[word] * (essay.count(word) / len(splittedEssay));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecEssayVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfEssay += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfEssay != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecEssayVector = tfIdfWeightedWord2VecEssayVector / cumulativeSumTfIdfWeightOfEssay;
# Appending the above calculated tf-idf weighted vector of particular essay to list of vectors of essays
tfIdfWeightedWord2VecEssaysVectors.append(tfIdfWeightedWord2VecEssayVector);
return tfIdfWeightedWord2VecEssaysVectors;
def getAvgTfIdfTitleVectors(arrayOfTexts):
# Creating list to save tf-idf weighted vectors of project titles
tfIdfWeightedWord2VecTitlesVectors = [];
# Iterating over each title
for title in tqdm(arrayOfTexts):
# Sum of tf-idf values of all words in a particular project title
cumulativeSumTfIdfWeightOfTitle = 0;
# Tf-Idf weighted word2vec vector of a particular project title
tfIdfWeightedWord2VecTitleVector = np.zeros(300);
# Splitting title into list of words
splittedTitle = title.split();
# Iterating over each word
for word in splittedTitle:
# Checking if word is in glove words and set of words used by tfIdf title vectorizer
if (word in gloveWords) and (word in tfIdfTitleWords):
# Tf-Idf value of particular word in title
tfIdfValueWord = tfIdfTitleDictionary[word] * (title.count(word) / len(splittedTitle));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecTitleVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfTitle += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfTitle != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecTitleVector = tfIdfWeightedWord2VecTitleVector / cumulativeSumTfIdfWeightOfTitle;
# Appending the above calculated tf-idf weighted vector of particular title to list of vectors of project titles
tfIdfWeightedWord2VecTitlesVectors.append(tfIdfWeightedWord2VecTitleVector);
return tfIdfWeightedWord2VecTitlesVectors;
resultsDataFrame = pd.DataFrame(columns = ['Vectorizer', 'Model', 'Hyper Parameter - K', 'AUC']);
resultsDataFrame
# Cross-validate data categorical features transformation
categoriesTransformedCrossValidateData = subjectsCategoriesVectorizer.transform(crossValidateData['cleaned_categories']);
subCategoriesTransformedCrossValidateData = subjectsSubCategoriesVectorizer.transform(crossValidateData['cleaned_sub_categories']);
teacherPrefixTransformedCrossValidateData = teacherPrefixVectorizer.transform(crossValidateData['teacher_prefix']);
schoolStateTransformedCrossValidateData = schoolStateVectorizer.transform(crossValidateData['school_state']);
projectGradeTransformedCrossValidateData = projectGradeVectorizer.transform(crossValidateData['project_grade_category']);
# Cross-validate data text features transformation
preProcessedEssaysTemp = preProcessingWithAndWithoutStopWords(crossValidateData['project_essay'])[1];
preProcessedTitlesTemp = preProcessingWithAndWithoutStopWords(crossValidateData['project_title'])[1];
bowEssayTransformedCrossValidateData = bowEssayVectorizer.transform(preProcessedEssaysTemp);
bowTitleTransformedCrossValidateData = bowTitleVectorizer.transform(preProcessedTitlesTemp);
tfIdfEssayTransformedCrossValidateData = tfIdfEssayVectorizer.transform(preProcessedEssaysTemp);
tfIdfTitleTransformedCrossValidateData = tfIdfTitleVectorizer.transform(preProcessedTitlesTemp);
word2VecEssayTransformedCrossValidateData = getWord2VecVectors(preProcessedEssaysTemp);
word2VecTitleTransformedCrossValidateData = getWord2VecVectors(preProcessedTitlesTemp);
tfIdfWeightedEssayTransformedCrossValidateData = getAvgTfIdfEssayVectors(preProcessedEssaysTemp);
tfIdfWeightedTitleTransformedCrossValidateData = getAvgTfIdfTitleVectors(preProcessedTitlesTemp);
# Cross-validate numerical features transformation
priceTransformedCrossValidateData = priceScaler.transform(crossValidateData['price'].values.reshape(-1, 1));
quantityTransformedCrossValidateData = quantityScaler.transform(crossValidateData['quantity'].values.reshape(-1, 1));
previouslyPostedTransformedCrossValidateData = previouslyPostedScaler.transform(crossValidateData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
# Test data categorical features transformation
categoriesTransformedTestData = subjectsCategoriesVectorizer.transform(testData['cleaned_categories']);
subCategoriesTransformedTestData = subjectsSubCategoriesVectorizer.transform(testData['cleaned_sub_categories']);
teacherPrefixTransformedTestData = teacherPrefixVectorizer.transform(testData['teacher_prefix']);
schoolStateTransformedTestData = schoolStateVectorizer.transform(testData['school_state']);
projectGradeTransformedTestData = projectGradeVectorizer.transform(testData['project_grade_category']);
# Test data text features transformation
preProcessedEssaysTemp = preProcessingWithAndWithoutStopWords(testData['project_essay'])[1];
preProcessedTitlesTemp = preProcessingWithAndWithoutStopWords(testData['project_title'])[1];
bowEssayTransformedTestData = bowEssayVectorizer.transform(preProcessedEssaysTemp);
bowTitleTransformedTestData = bowTitleVectorizer.transform(preProcessedTitlesTemp);
tfIdfEssayTransformedTestData = tfIdfEssayVectorizer.transform(preProcessedEssaysTemp);
tfIdfTitleTransformedTestData = tfIdfTitleVectorizer.transform(preProcessedTitlesTemp);
word2VecEssayTransformedTestData = getWord2VecVectors(preProcessedEssaysTemp);
word2VecTitleTransformedTestData = getWord2VecVectors(preProcessedTitlesTemp);
tfIdfWeightedEssayTransformedTestData = getAvgTfIdfEssayVectors(preProcessedEssaysTemp);
tfIdfWeightedTitleTransformedTestData = getAvgTfIdfTitleVectors(preProcessedTitlesTemp);
# Test data numerical features transformation
priceTransformedTestData = priceScaler.transform(testData['price'].values.reshape(-1, 1));
quantityTransformedTestData = quantityScaler.transform(testData['quantity'].values.reshape(-1, 1));
previouslyPostedTransformedTestData = previouslyPostedScaler.transform(testData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
testKValues = np.arange(1, 40, 2);
techniques = ['Bag of Words', 'Tf-Idf', 'Average Word2Vec', 'Tf-Idf Weighted Word2Vec'];
for index, technique in enumerate(techniques):
areaUnderRocValuesTrain = [];
areaUnderRocValuesCrossValidate = [];
trainingMergedData = hstack((categoriesVectorsSub,\
subCategoriesVectorsSub,\
teacherPrefixVectorsSub,\
schoolStateVectorsSub,\
projectGradeVectorsSub,\
priceStandardizedSub,\
previouslyPostedStandardizedSub));
crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
subCategoriesTransformedCrossValidateData,\
teacherPrefixTransformedCrossValidateData,\
schoolStateTransformedCrossValidateData,\
projectGradeTransformedCrossValidateData,\
priceTransformedCrossValidateData,\
previouslyPostedTransformedCrossValidateData));
testMergedData = hstack((categoriesTransformedTestData,\
subCategoriesTransformedTestData,\
teacherPrefixTransformedTestData,\
schoolStateTransformedTestData,\
projectGradeTransformedTestData,\
priceTransformedTestData,\
previouslyPostedTransformedTestData));
if(index == 0):
trainingMergedData = hstack((trainingMergedData,\
bowTitleModelSub,\
bowEssayModelSub));
crossValidateMergedData = hstack((crossValidateMergedData,\
bowTitleTransformedCrossValidateData,\
bowEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
bowTitleTransformedTestData,\
bowEssayTransformedTestData));
elif(index == 1):
trainingMergedData = hstack((trainingMergedData,\
tfIdfTitleModelSub,\
tfIdfEssayModelSub));
crossValidateMergedData = hstack((crossValidateMergedData,\
tfIdfTitleTransformedCrossValidateData,\
tfIdfEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
tfIdfTitleTransformedTestData,\
tfIdfEssayTransformedTestData));
elif(index == 2):
trainingMergedData = hstack((trainingMergedData,\
word2VecTitlesVectorsSub,\
word2VecEssaysVectorsSub));
crossValidateMergedData = hstack((crossValidateMergedData,\
word2VecTitleTransformedCrossValidateData,\
word2VecEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
word2VecTitleTransformedTestData,\
word2VecEssayTransformedTestData));
elif(index == 3):
trainingMergedData = hstack((trainingMergedData,\
tfIdfWeightedWord2VecTitlesVectorsSub,\
tfIdfWeightedWord2VecEssaysVectorsSub));
crossValidateMergedData = hstack((crossValidateMergedData,\
tfIdfWeightedTitleTransformedCrossValidateData,\
tfIdfWeightedEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
tfIdfWeightedTitleTransformedTestData,\
tfIdfWeightedEssayTransformedTestData));
for testKValue in tqdm(testKValues):
knnClassifier = KNeighborsClassifier(n_neighbors = testKValue, algorithm = 'brute');
knnClassifier.fit(trainingMergedData, classesTraining);
predProbScores = knnClassifier.predict_proba(trainingMergedData);
fpr, tpr, threshold = roc_curve(classesTraining, predProbScores[:, 1]);
areaUnderRocValuesTrain.append(auc(fpr, tpr));
predProbScores = knnClassifier.predict_proba(crossValidateMergedData);
fpr, tpr, threshold = roc_curve(classesCrossValidate, predProbScores[:, 1]);
areaUnderRocValuesCrossValidate.append(auc(fpr, tpr));
plt.plot(testKValues, areaUnderRocValuesTrain, 'r', label = "Training K vs AUC");
plt.plot(testKValues, areaUnderRocValuesCrossValidate, 'b', label = "Cross Validate K vs AUC");
plt.title("Training & Cross-validate K vs AUC - {} vectorized text".format(technique));
plt.xlabel("Hyper parameter - K");
plt.ylabel("Area under curve - AUC");
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show();
optimalKValue = testKValues[np.argmax(areaUnderRocValuesCrossValidate)];
knnClassifier = KNeighborsClassifier(n_neighbors = optimalKValue, algorithm = 'brute');
knnClassifier.fit(trainingMergedData, classesTraining);
predProbScoresCrossValidate = knnClassifier.predict_proba(crossValidateMergedData);
fprCrossValidate, tprCrossValidate, thresholdTrain = roc_curve(classesCrossValidate, predProbScoresCrossValidate[:, 1]);
predProbScoresTest = knnClassifier.predict_proba(testMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predProbScoresTest[:, 1]);
areaUnderRocValueTest = auc(fprTest, tprTest);
plt.plot(fprCrossValidate, tprCrossValidate, 'y', label="Train ROC curve - {} vectorized text".format(technique));
plt.plot(fprTest, tprTest, 'g', label="Test ROC curve - {} vectorized text".format(technique));
plt.plot([0, 1], [0, 1], 'k-');
plt.title("ROC Curves for train and test data using k-value {}".format(optimalKValue))
plt.xlabel('False positive rate - FPR');
plt.ylabel('True positive rate - TPR');
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show();
print("Results of analysis using {} vectorized text features merged with other features using K-NN brute force algorithm:".format(technique));
equalsBorder(70);
print("AUC values of cross-validate data: ");
equalsBorder(40);
print(areaUnderRocValuesCrossValidate);
equalsBorder(40);
print("Optimal K-Value: ", optimalKValue);
equalsBorder(40);
print("AUC value of test data: ", areaUnderRocValueTest);
# Predicting classes of test data projects
predictionClassesTest = knnClassifier.predict(testMergedData);
equalsBorder(40);
# Adding results to results dataframe
resultsDataFrame = resultsDataFrame.append({'Vectorizer': technique, 'Model': 'Brute', 'Hyper Parameter - K': optimalKValue, 'AUC': areaUnderRocValueTest}, ignore_index = True);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
print(confusionMatrixDataFrame);
equalsBorder(110);
equalsBorder(110);
equalsBorder(110);
testKValues = np.arange(1, 40, 2);
techniques = ['Bag of Words', 'Tf-Idf', 'Average Word2Vec', 'Tf-Idf Weighted Word2Vec'];
for index, technique in enumerate(techniques):
areaUnderRocValuesTrain = [];
areaUnderRocValuesCrossValidate = [];
trainingMergedData = hstack((categoriesVectorsSub,\
subCategoriesVectorsSub,\
teacherPrefixVectorsSub,\
schoolStateVectorsSub,\
projectGradeVectorsSub,\
priceStandardizedSub,\
previouslyPostedStandardizedSub));
crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
subCategoriesTransformedCrossValidateData,\
teacherPrefixTransformedCrossValidateData,\
schoolStateTransformedCrossValidateData,\
projectGradeTransformedCrossValidateData,\
priceTransformedCrossValidateData,\
previouslyPostedTransformedCrossValidateData));
testMergedData = hstack((categoriesTransformedTestData,\
subCategoriesTransformedTestData,\
teacherPrefixTransformedTestData,\
schoolStateTransformedTestData,\
projectGradeTransformedTestData,\
priceTransformedTestData,\
previouslyPostedTransformedTestData));
if(index == 0):
trainingMergedData = hstack((trainingMergedData,\
bowTitleModelSub,\
bowEssayModelSub));
crossValidateMergedData = hstack((crossValidateMergedData,\
bowTitleTransformedCrossValidateData,\
bowEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
bowTitleTransformedTestData,\
bowEssayTransformedTestData));
elif(index == 1):
trainingMergedData = hstack((trainingMergedData,\
tfIdfTitleModelSub,\
tfIdfEssayModelSub));
crossValidateMergedData = hstack((crossValidateMergedData,\
tfIdfTitleTransformedCrossValidateData,\
tfIdfEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
tfIdfTitleTransformedTestData,\
tfIdfEssayTransformedTestData));
elif(index == 2):
trainingMergedData = hstack((trainingMergedData,\
word2VecTitlesVectorsSub,\
word2VecEssaysVectorsSub));
crossValidateMergedData = hstack((crossValidateMergedData,\
word2VecTitleTransformedCrossValidateData,\
word2VecEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
word2VecTitleTransformedTestData,\
word2VecEssayTransformedTestData));
elif(index == 3):
trainingMergedData = hstack((trainingMergedData,\
tfIdfWeightedWord2VecTitlesVectorsSub,\
tfIdfWeightedWord2VecEssaysVectorsSub));
crossValidateMergedData = hstack((crossValidateMergedData,\
tfIdfWeightedTitleTransformedCrossValidateData,\
tfIdfWeightedEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
tfIdfWeightedTitleTransformedTestData,\
tfIdfWeightedEssayTransformedTestData));
for testKValue in tqdm(testKValues):
knnClassifier = KNeighborsClassifier(n_neighbors = testKValue, algorithm = 'brute');
knnClassifier.fit(trainingMergedData, classesTraining);
predProbScores = knnClassifier.predict_proba(trainingMergedData);
fpr, tpr, threshold = roc_curve(classesTraining, predProbScores[:, 1]);
areaUnderRocValuesTrain.append(auc(fpr, tpr));
predProbScores = knnClassifier.predict_proba(crossValidateMergedData);
fpr, tpr, threshold = roc_curve(classesCrossValidate, predProbScores[:, 1]);
areaUnderRocValuesCrossValidate.append(auc(fpr, tpr));
plt.plot(testKValues, areaUnderRocValuesTrain, 'r', label = "Training K vs AUC");
plt.plot(testKValues, areaUnderRocValuesCrossValidate, 'b', label = "Cross Validate K vs AUC");
plt.title("Training & Cross-validate K vs AUC - {} vectorized text".format(technique));
plt.xlabel("Hyper parameter - K");
plt.ylabel("Area under curve - AUC");
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show();
optimalKValue = testKValues[np.argmax(areaUnderRocValuesCrossValidate)];
knnClassifier = KNeighborsClassifier(n_neighbors = optimalKValue, algorithm = 'brute');
knnClassifier.fit(trainingMergedData, classesTraining);
predProbScoresCrossValidate = knnClassifier.predict_proba(crossValidateMergedData);
fprCrossValidate, tprCrossValidate, thresholdTrain = roc_curve(classesCrossValidate, predProbScoresCrossValidate[:, 1]);
predProbScoresTest = knnClassifier.predict_proba(testMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predProbScoresTest[:, 1]);
areaUnderRocValueTest = auc(fprTest, tprTest);
plt.plot(fprCrossValidate, tprCrossValidate, 'y', label="Train ROC curve - {} vectorized text".format(technique));
plt.plot(fprTest, tprTest, 'g', label="Test ROC curve - {} vectorized text".format(technique));
plt.plot([0, 1], [0, 1], 'k-');
plt.title("ROC Curves for train and test data using k-value {}".format(optimalKValue))
plt.xlabel('False positive rate - FPR');
plt.ylabel('True positive rate - TPR');
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show();
print("Results of analysis using {} vectorized text features merged with other features using K-NN brute force algorithm:".format(technique));
equalsBorder(70);
print("AUC values of cross-validate data: ");
equalsBorder(40);
print(areaUnderRocValuesCrossValidate);
equalsBorder(40);
print("Optimal K-Value: ", optimalKValue);
equalsBorder(40);
print("AUC value of test data: ", areaUnderRocValueTest);
# Predicting classes of test data projects
predictionClassesTest = knnClassifier.predict(testMergedData);
equalsBorder(40);
# Adding results to results dataframe
balancedDataResultsDataFrame = balancedDataResultsDataFrame.append({'Vectorizer': technique, 'Model': 'Brute', 'Hyper Parameter - K': optimalKValue, 'AUC': areaUnderRocValueTest}, ignore_index = True);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
print(confusionMatrixDataFrame);
equalsBorder(110);
equalsBorder(110);
equalsBorder(110);
projectsData = projectsData.dropna(subset = ['teacher_prefix']);
projectsData.shape
classesData = projectsData['project_is_approved']
print(classesData.shape)
trainingData, testData, classesTraining, classesTest = cross_validation.train_test_split(projectsData[0:10000], classesData[0:10000], test_size = 0.3, random_state = 0);
print("Shapes of splitted data: ");
equalsBorder(70);
print("testData shape: ", testData.shape);
print("classesTest: ", classesTest.shape);
print("trainingData shape: ", trainingData.shape);
print("classesTraining shape: ", classesTraining.shape);
print("Number of negative points: ", trainingData[trainingData['project_is_approved'] == 0].shape);
print("Number of positive points: ", trainingData[trainingData['project_is_approved'] == 1].shape);
negativeData = trainingData[trainingData['project_is_approved'] == 0];
positiveData = trainingData[trainingData['project_is_approved'] == 1];
negativeDataBalanced = resample(negativeData, replace = True, n_samples = 5957, random_state = 44);
trainingData = pd.concat([positiveData, negativeDataBalanced]);
trainingData = shuffle(trainingData);
classesTraining = trainingData['project_is_approved'];
print("Testing whether data is balanced: ");
equalsBorder(60);
print("Number of positive points: ", trainingData[trainingData['project_is_approved'] == 1].shape);
print("Number of negative points: ", trainingData[trainingData['project_is_approved'] == 0].shape);
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique cleaned_categories
subjectsCategoriesVectorizer = CountVectorizer(vocabulary = list(sortedCategoriesDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with cleaned_categories values
subjectsCategoriesVectorizer.fit(trainingData['cleaned_categories'].values);
# Vectorizing categories using one-hot-encoding
categoriesVectors = subjectsCategoriesVectorizer.transform(trainingData['cleaned_categories'].values);
print("Features used in vectorizing categories: ");
equalsBorder(70);
print(subjectsCategoriesVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of cleaned_categories matrix after vectorization(one-hot-encoding): ", categoriesVectors.shape);
equalsBorder(70);
print("Sample vectors of categories: ");
equalsBorder(70);
print(categoriesVectors[0:4])
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique cleaned_sub_categories
subjectsSubCategoriesVectorizer = CountVectorizer(vocabulary = list(sortedDictionarySubCategories.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with cleaned_sub_categories values
subjectsSubCategoriesVectorizer.fit(trainingData['cleaned_sub_categories'].values);
# Vectorizing sub categories using one-hot-encoding
subCategoriesVectors = subjectsSubCategoriesVectorizer.transform(trainingData['cleaned_sub_categories'].values);
print("Features used in vectorizing subject sub categories: ");
equalsBorder(70);
print(subjectsSubCategoriesVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of cleaned_categories matrix after vectorization(one-hot-encoding): ", subCategoriesVectors.shape);
equalsBorder(70);
print("Sample vectors of categories: ");
equalsBorder(70);
print(subCategoriesVectors[0:4])
def giveCounter(data):
counter = Counter();
for dataValue in data:
counter.update(str(dataValue).split());
return counter
giveCounter(trainingData['teacher_prefix'].values)
teacherPrefixDictionary = dict(giveCounter(trainingData['teacher_prefix'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique teacher_prefix
teacherPrefixVectorizer = CountVectorizer(vocabulary = list(teacherPrefixDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with teacher_prefix values
teacherPrefixVectorizer.fit(trainingData['teacher_prefix'].values);
# Vectorizing teacher_prefix using one-hot-encoding
teacherPrefixVectors = teacherPrefixVectorizer.transform(trainingData['teacher_prefix'].values);
print("Features used in vectorizing teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of teacher_prefix matrix after vectorization(one-hot-encoding): ", teacherPrefixVectors.shape);
equalsBorder(70);
print("Sample vectors of teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectors[0:100]);
teacherPrefixes = [prefix.replace('.', '') for prefix in trainingData['teacher_prefix'].values];
teacherPrefixes[0:5]
trainingData['teacher_prefix'] = teacherPrefixes;
trainingData.head(3)
teacherPrefixDictionary = dict(giveCounter(trainingData['teacher_prefix'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique teacher_prefix
teacherPrefixVectorizer = CountVectorizer(vocabulary = list(teacherPrefixDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with teacher_prefix values
teacherPrefixVectorizer.fit(trainingData['teacher_prefix'].values);
# Vectorizing teacher_prefix using one-hot-encoding
teacherPrefixVectors = teacherPrefixVectorizer.transform(trainingData['teacher_prefix'].values);
print("Features used in vectorizing teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of teacher_prefix matrix after vectorization(one-hot-encoding): ", teacherPrefixVectors.shape);
equalsBorder(70);
print("Sample vectors of teacher_prefix: ");
equalsBorder(70);
print(teacherPrefixVectors[0:4]);
schoolStateDictionary = dict(giveCounter(trainingData['school_state'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique school states
schoolStateVectorizer = CountVectorizer(vocabulary = list(schoolStateDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with school_state values
schoolStateVectorizer.fit(trainingData['school_state'].values);
# Vectorizing school_state using one-hot-encoding
schoolStateVectors = schoolStateVectorizer.transform(trainingData['school_state'].values);
print("Features used in vectorizing school_state: ");
equalsBorder(70);
print(schoolStateVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of school_state matrix after vectorization(one-hot-encoding): ", schoolStateVectors.shape);
equalsBorder(70);
print("Sample vectors of school_state: ");
equalsBorder(70);
print(schoolStateVectors[0:4]);
giveCounter(trainingData['project_grade_category'])
cleanedGrades = []
for grade in trainingData['project_grade_category'].values:
grade = grade.replace(' ', '');
grade = grade.replace('-', 'to');
cleanedGrades.append(grade);
cleanedGrades[0:4]
trainingData['project_grade_category'] = cleanedGrades
trainingData.head(4)
projectGradeDictionary = dict(giveCounter(trainingData['project_grade_category'].values));
# Using CountVectorizer for performing one-hot-encoding by setting vocabulary as list of all unique project grade categories
projectGradeVectorizer = CountVectorizer(vocabulary = list(projectGradeDictionary.keys()), lowercase = False, binary = True);
# Fitting CountVectorizer with project_grade_category values
projectGradeVectorizer.fit(trainingData['project_grade_category'].values);
# Vectorizing project_grade_category using one-hot-encoding
projectGradeVectors = projectGradeVectorizer.transform(trainingData['project_grade_category'].values);
print("Features used in vectorizing project_grade_category: ");
equalsBorder(70);
print(projectGradeVectorizer.get_feature_names());
equalsBorder(70);
print("Shape of school_state matrix after vectorization(one-hot-encoding): ", projectGradeVectors.shape);
equalsBorder(70);
print("Sample vectors of school_state: ");
equalsBorder(70);
print(projectGradeVectors[0:4]);
preProcessedEssaysWithStopWords, preProcessedEssaysWithoutStopWords = preProcessingWithAndWithoutStopWords(trainingData['project_essay']);
preProcessedProjectTitlesWithStopWords, preProcessedProjectTitlesWithoutStopWords = preProcessingWithAndWithoutStopWords(trainingData['project_title']);
# Initializing countvectorizer for bag of words vectorization of preprocessed project essays
bowEssayVectorizer = CountVectorizer(min_df = 10);
# Transforming the preprocessed essays to bag of words vectors
bowEssayModel = bowEssayVectorizer.fit_transform(preProcessedEssaysWithoutStopWords);
print("Some of the Features used in vectorizing preprocessed essays: ");
equalsBorder(70);
print(bowEssayVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed essay matrix after vectorization: ", bowEssayModel.shape);
equalsBorder(70);
print("Sample bag-of-words vector of preprocessed essay: ");
equalsBorder(70);
print(bowEssayModel[0])
# Initializing countvectorizer for bag of words vectorization of preprocessed project titles
bowTitleVectorizer = CountVectorizer(min_df = 10);
# Transforming the preprocessed project titles to bag of words vectors
bowTitleModel = bowTitleVectorizer.fit_transform(preProcessedProjectTitlesWithoutStopWords);
print("Some of the Features used in vectorizing preprocessed titles: ");
equalsBorder(70);
print(bowTitleVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after vectorization: ", bowTitleModel.shape);
equalsBorder(70);
print("Sample bag-of-words vector of preprocessed title: ");
equalsBorder(70);
print(bowTitleModel[0])
# Intializing tfidf vectorizer for tf-idf vectorization of preprocessed project essays
tfIdfEssayVectorizer = TfidfVectorizer(min_df = 10);
# Transforming the preprocessed project essays to tf-idf vectors
tfIdfEssayModel = tfIdfEssayVectorizer.fit_transform(preProcessedEssaysWithoutStopWords);
print("Some of the Features used in tf-idf vectorizing preprocessed essays: ");
equalsBorder(70);
print(tfIdfEssayVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after tf-idf vectorization: ", tfIdfEssayModel.shape);
equalsBorder(70);
print("Sample Tf-Idf vector of preprocessed essay: ");
equalsBorder(70);
print(tfIdfEssayModel[0])
# Intializing tfidf vectorizer for tf-idf vectorization of preprocessed project titles
tfIdfTitleVectorizer = TfidfVectorizer(min_df = 10);
# Transforming the preprocessed project titles to tf-idf vectors
tfIdfTitleModel = tfIdfTitleVectorizer.fit_transform(preProcessedProjectTitlesWithoutStopWords);
print("Some of the Features used in tf-idf vectorizing preprocessed titles: ");
equalsBorder(70);
print(tfIdfTitleVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after tf-idf vectorization: ", tfIdfTitleModel.shape);
equalsBorder(70);
print("Sample Tf-Idf vector of preprocessed title: ");
equalsBorder(70);
print(tfIdfTitleModel[0])
# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
# We should have glove_vectors file for creating below model
with open('glove_vectors', 'rb') as f:
gloveModel = pickle.load(f)
gloveWords = set(gloveModel.keys())
print("Glove vector of sample word: ");
equalsBorder(70);
print(gloveModel['technology']);
equalsBorder(70);
print("Shape of glove vector: ", gloveModel['technology'].shape);
def getWord2VecVectors(texts):
word2VecTextsVectors = [];
for preProcessedText in tqdm(texts):
word2VecTextVector = np.zeros(300);
numberOfWordsInText = 0;
for word in preProcessedText.split():
if word in gloveWords:
word2VecTextVector += gloveModel[word];
numberOfWordsInText += 1;
if numberOfWordsInText != 0:
word2VecTextVector = word2VecTextVector / numberOfWordsInText;
word2VecTextsVectors.append(word2VecTextVector);
return word2VecTextsVectors;
word2VecEssaysVectors = getWord2VecVectors(preProcessedEssaysWithoutStopWords);
print("Shape of Word2Vec vectorization matrix of essays: {},{}".format(len(word2VecEssaysVectors), len(word2VecEssaysVectors[0])));
equalsBorder(70);
print("Sample essay: ");
equalsBorder(70);
print(preProcessedEssaysWithoutStopWords[0]);
equalsBorder(70);
print("Word2Vec vector of sample essay: ");
equalsBorder(70);
print(word2VecEssaysVectors[0]);
word2VecTitlesVectors = getWord2VecVectors(preProcessedProjectTitlesWithoutStopWords);
print("Shape of Word2Vec vectorization matrix of project titles: {}, {}".format(len(word2VecTitlesVectors), len(word2VecTitlesVectors[0])));
equalsBorder(70);
print("Sample title: ");
equalsBorder(70);
print(preProcessedProjectTitlesWithoutStopWords[0]);
equalsBorder(70);
print("Word2Vec vector of sample title: ");
equalsBorder(70);
print(word2VecTitlesVectors[0]);
# Initializing tfidf vectorizer
tfIdfEssayTempVectorizer = TfidfVectorizer();
# Vectorizing preprocessed essays using tfidf vectorizer initialized above
tfIdfEssayTempVectorizer.fit(preProcessedEssaysWithoutStopWords);
# Saving dictionary in which each word is key and it's idf is value
tfIdfEssayDictionary = dict(zip(tfIdfEssayTempVectorizer.get_feature_names(), list(tfIdfEssayTempVectorizer.idf_)));
# Creating set of all unique words used by tfidf vectorizer
tfIdfEssayWords = set(tfIdfEssayTempVectorizer.get_feature_names());
# Creating list to save tf-idf weighted vectors of essays
tfIdfWeightedWord2VecEssaysVectors = [];
# Iterating over each essay
for essay in tqdm(preProcessedEssaysWithoutStopWords):
# Sum of tf-idf values of all words in a particular essay
cumulativeSumTfIdfWeightOfEssay = 0;
# Tf-Idf weighted word2vec vector of a particular essay
tfIdfWeightedWord2VecEssayVector = np.zeros(300);
# Splitting essay into list of words
splittedEssay = essay.split();
# Iterating over each word
for word in splittedEssay:
# Checking if word is in glove words and set of words used by tfIdf essay vectorizer
if (word in gloveWords) and (word in tfIdfEssayWords):
# Tf-Idf value of particular word in essay
tfIdfValueWord = tfIdfEssayDictionary[word] * (essay.count(word) / len(splittedEssay));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecEssayVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfEssay += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfEssay != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecEssayVector = tfIdfWeightedWord2VecEssayVector / cumulativeSumTfIdfWeightOfEssay;
# Appending the above calculated tf-idf weighted vector of particular essay to list of vectors of essays
tfIdfWeightedWord2VecEssaysVectors.append(tfIdfWeightedWord2VecEssayVector);
print("Shape of Tf-Idf weighted Word2Vec vectorization matrix of project essays: {}, {}".format(len(tfIdfWeightedWord2VecEssaysVectors), len(tfIdfWeightedWord2VecEssaysVectors[0])));
equalsBorder(70);
print("Sample Essay: ");
equalsBorder(70);
print(preProcessedEssaysWithoutStopWords[0]);
equalsBorder(70);
print("Tf-Idf Weighted Word2Vec vector of sample essay: ");
equalsBorder(70);
print(tfIdfWeightedWord2VecEssaysVectors[0]);
# Initializing tfidf vectorizer
tfIdfTitleTempVectorizer = TfidfVectorizer();
# Vectorizing preprocessed titles using tfidf vectorizer initialized above
tfIdfTitleTempVectorizer.fit(preProcessedProjectTitlesWithoutStopWords);
# Saving dictionary in which each word is key and it's idf is value
tfIdfTitleDictionary = dict(zip(tfIdfTitleTempVectorizer.get_feature_names(), list(tfIdfTitleTempVectorizer.idf_)));
# Creating set of all unique words used by tfidf vectorizer
tfIdfTitleWords = set(tfIdfTitleTempVectorizer.get_feature_names());
# Creating list to save tf-idf weighted vectors of project titles
tfIdfWeightedWord2VecTitlesVectors = [];
# Iterating over each title
for title in tqdm(preProcessedProjectTitlesWithoutStopWords):
# Sum of tf-idf values of all words in a particular project title
cumulativeSumTfIdfWeightOfTitle = 0;
# Tf-Idf weighted word2vec vector of a particular project title
tfIdfWeightedWord2VecTitleVector = np.zeros(300);
# Splitting title into list of words
splittedTitle = title.split();
# Iterating over each word
for word in splittedTitle:
# Checking if word is in glove words and set of words used by tfIdf title vectorizer
if (word in gloveWords) and (word in tfIdfTitleWords):
# Tf-Idf value of particular word in title
tfIdfValueWord = tfIdfTitleDictionary[word] * (title.count(word) / len(splittedTitle));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecTitleVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfTitle += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfTitle != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecTitleVector = tfIdfWeightedWord2VecTitleVector / cumulativeSumTfIdfWeightOfTitle;
# Appending the above calculated tf-idf weighted vector of particular title to list of vectors of project titles
tfIdfWeightedWord2VecTitlesVectors.append(tfIdfWeightedWord2VecTitleVector);
print("Shape of Tf-Idf weighted Word2Vec vectorization matrix of project titles: {}, {}".format(len(tfIdfWeightedWord2VecTitlesVectors), len(tfIdfWeightedWord2VecTitlesVectors[0])));
equalsBorder(70);
print("Sample Title: ");
equalsBorder(70);
print(preProcessedProjectTitlesWithoutStopWords[0]);
equalsBorder(70);
print("Tf-Idf Weighted Word2Vec vector of sample title: ");
equalsBorder(70);
print(tfIdfWeightedWord2VecTitlesVectors[0]);
# Standardizing the price data using StandardScaler(Uses mean and std for standardization)
priceScaler = StandardScaler();
priceScaler.fit(trainingData['price'].values.reshape(-1, 1));
priceStandardized = priceScaler.transform(trainingData['price'].values.reshape(-1, 1));
print("Shape of standardized matrix of prices: ", priceStandardized.shape);
equalsBorder(70);
print("Sample original prices: ");
equalsBorder(70);
print(trainingData['price'].values[0:5]);
print("Sample standardized prices: ");
equalsBorder(70);
print(priceStandardized[0:5]);
# Standardizing the quantity data using StandardScaler(Uses mean and std for standardization)
quantityScaler = StandardScaler();
quantityScaler.fit(trainingData['quantity'].values.reshape(-1, 1));
quantityStandardized = quantityScaler.transform(trainingData['quantity'].values.reshape(-1, 1));
print("Shape of standardized matrix of quantities: ", quantityStandardized.shape);
equalsBorder(70);
print("Sample original quantities: ");
equalsBorder(70);
print(trainingData['quantity'].values[0:5]);
print("Sample standardized quantities: ");
equalsBorder(70);
print(quantityStandardized[0:5]);
# Standardizing the teacher_number_of_previously_posted_projects data using StandardScaler(Uses mean and std for standardization)
previouslyPostedScaler = StandardScaler();
previouslyPostedScaler.fit(trainingData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
previouslyPostedStandardized = previouslyPostedScaler.transform(trainingData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
print("Shape of standardized matrix of teacher_number_of_previously_posted_projects: ", previouslyPostedStandardized.shape);
equalsBorder(70);
print("Sample original quantities: ");
equalsBorder(70);
print(trainingData['teacher_number_of_previously_posted_projects'].values[0:5]);
print("Sample standardized teacher_number_of_previously_posted_projects: ");
equalsBorder(70);
print(previouslyPostedStandardized[0:5]);
numberOfPoints = 16638;
# Categorical data
categoriesVectorsSub = categoriesVectors[0:numberOfPoints];
subCategoriesVectorsSub = subCategoriesVectors[0:numberOfPoints];
teacherPrefixVectorsSub = teacherPrefixVectors[0:numberOfPoints];
schoolStateVectorsSub = schoolStateVectors[0:numberOfPoints];
projectGradeVectorsSub = projectGradeVectors[0:numberOfPoints];
# Text data
bowEssayModelSub = bowEssayModel[0:numberOfPoints];
bowTitleModelSub = bowTitleModel[0:numberOfPoints];
tfIdfEssayModelSub = tfIdfEssayModel[0:numberOfPoints];
tfIdfTitleModelSub = tfIdfTitleModel[0:numberOfPoints];
word2VecEssaysVectorsSub = word2VecEssaysVectors[0:numberOfPoints];
word2VecTitlesVectorsSub = word2VecTitlesVectors[0:numberOfPoints];
tfIdfWeightedWord2VecEssaysVectorsSub = tfIdfWeightedWord2VecEssaysVectors[0:numberOfPoints];
tfIdfWeightedWord2VecTitlesVectorsSub = tfIdfWeightedWord2VecTitlesVectors[0:numberOfPoints];
# Numerical data
priceStandardizedSub = priceStandardized[0:numberOfPoints];
quantityStandardizedSub = quantityStandardized[0:numberOfPoints];
previouslyPostedStandardizedSub = previouslyPostedStandardized[0:numberOfPoints];
def getAvgTfIdfEssayVectors(arrayOfTexts):
# Creating list to save tf-idf weighted vectors of essays
tfIdfWeightedWord2VecEssaysVectors = [];
# Iterating over each essay
for essay in tqdm(arrayOfTexts):
# Sum of tf-idf values of all words in a particular essay
cumulativeSumTfIdfWeightOfEssay = 0;
# Tf-Idf weighted word2vec vector of a particular essay
tfIdfWeightedWord2VecEssayVector = np.zeros(300);
# Splitting essay into list of words
splittedEssay = essay.split();
# Iterating over each word
for word in splittedEssay:
# Checking if word is in glove words and set of words used by tfIdf essay vectorizer
if (word in gloveWords) and (word in tfIdfEssayWords):
# Tf-Idf value of particular word in essay
tfIdfValueWord = tfIdfEssayDictionary[word] * (essay.count(word) / len(splittedEssay));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecEssayVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfEssay += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfEssay != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecEssayVector = tfIdfWeightedWord2VecEssayVector / cumulativeSumTfIdfWeightOfEssay;
# Appending the above calculated tf-idf weighted vector of particular essay to list of vectors of essays
tfIdfWeightedWord2VecEssaysVectors.append(tfIdfWeightedWord2VecEssayVector);
return tfIdfWeightedWord2VecEssaysVectors;
def getAvgTfIdfTitleVectors(arrayOfTexts):
# Creating list to save tf-idf weighted vectors of project titles
tfIdfWeightedWord2VecTitlesVectors = [];
# Iterating over each title
for title in tqdm(arrayOfTexts):
# Sum of tf-idf values of all words in a particular project title
cumulativeSumTfIdfWeightOfTitle = 0;
# Tf-Idf weighted word2vec vector of a particular project title
tfIdfWeightedWord2VecTitleVector = np.zeros(300);
# Splitting title into list of words
splittedTitle = title.split();
# Iterating over each word
for word in splittedTitle:
# Checking if word is in glove words and set of words used by tfIdf title vectorizer
if (word in gloveWords) and (word in tfIdfTitleWords):
# Tf-Idf value of particular word in title
tfIdfValueWord = tfIdfTitleDictionary[word] * (title.count(word) / len(splittedTitle));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecTitleVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfTitle += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfTitle != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecTitleVector = tfIdfWeightedWord2VecTitleVector / cumulativeSumTfIdfWeightOfTitle;
# Appending the above calculated tf-idf weighted vector of particular title to list of vectors of project titles
tfIdfWeightedWord2VecTitlesVectors.append(tfIdfWeightedWord2VecTitleVector);
return tfIdfWeightedWord2VecTitlesVectors;
kFoldResultsDataFrame = pd.DataFrame(columns = ['Vectorizer', 'Model', 'Hyper Parameter - K', 'AUC']);
kFoldResultsDataFrame
# Test data categorical features transformation
categoriesTransformedTestData = subjectsCategoriesVectorizer.transform(testData['cleaned_categories']);
subCategoriesTransformedTestData = subjectsSubCategoriesVectorizer.transform(testData['cleaned_sub_categories']);
teacherPrefixTransformedTestData = teacherPrefixVectorizer.transform(testData['teacher_prefix']);
schoolStateTransformedTestData = schoolStateVectorizer.transform(testData['school_state']);
projectGradeTransformedTestData = projectGradeVectorizer.transform(testData['project_grade_category']);
# Test data text features transformation
preProcessedEssaysTemp = preProcessingWithAndWithoutStopWords(testData['project_essay'])[1];
preProcessedTitlesTemp = preProcessingWithAndWithoutStopWords(testData['project_title'])[1];
bowEssayTransformedTestData = bowEssayVectorizer.transform(preProcessedEssaysTemp);
bowTitleTransformedTestData = bowTitleVectorizer.transform(preProcessedTitlesTemp);
tfIdfEssayTransformedTestData = tfIdfEssayVectorizer.transform(preProcessedEssaysTemp);
tfIdfTitleTransformedTestData = tfIdfTitleVectorizer.transform(preProcessedTitlesTemp);
word2VecEssayTransformedTestData = getWord2VecVectors(preProcessedEssaysTemp);
word2VecTitleTransformedTestData = getWord2VecVectors(preProcessedTitlesTemp);
tfIdfWeightedEssayTransformedTestData = getAvgTfIdfEssayVectors(preProcessedEssaysTemp);
tfIdfWeightedTitleTransformedTestData = getAvgTfIdfTitleVectors(preProcessedTitlesTemp);
# Test data numerical features transformation
priceTransformedTestData = priceScaler.transform(testData['price'].values.reshape(-1, 1));
quantityTransformedTestData = quantityScaler.transform(testData['quantity'].values.reshape(-1, 1));
previouslyPostedTransformedTestData = previouslyPostedScaler.transform(testData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
testKValues = np.arange(1, 40, 2);
techniques = ['Bag of Words', 'Tf-Idf', 'Average Word2Vec', 'Tf-Idf Weighted Word2Vec'];
for index, technique in enumerate(techniques):
areaUnderRocValuesTrain = [];
trainingMergedData = hstack((categoriesVectorsSub,\
subCategoriesVectorsSub,\
teacherPrefixVectorsSub,\
schoolStateVectorsSub,\
projectGradeVectorsSub,\
priceStandardizedSub,\
previouslyPostedStandardizedSub));
testMergedData = hstack((categoriesTransformedTestData,\
subCategoriesTransformedTestData,\
teacherPrefixTransformedTestData,\
schoolStateTransformedTestData,\
projectGradeTransformedTestData,\
priceTransformedTestData,\
previouslyPostedTransformedTestData));
if(index == 0):
trainingMergedData = hstack((trainingMergedData,\
bowTitleModelSub,\
bowEssayModelSub));
testMergedData = hstack((testMergedData,\
bowTitleTransformedTestData,\
bowEssayTransformedTestData));
elif(index == 1):
trainingMergedData = hstack((trainingMergedData,\
tfIdfTitleModelSub,\
tfIdfEssayModelSub));
testMergedData = hstack((testMergedData,\
tfIdfTitleTransformedTestData,\
tfIdfEssayTransformedTestData));
elif(index == 2):
trainingMergedData = hstack((trainingMergedData,\
word2VecTitlesVectorsSub,\
word2VecEssaysVectorsSub));
testMergedData = hstack((testMergedData,\
word2VecTitleTransformedTestData,\
word2VecEssayTransformedTestData));
elif(index == 3):
trainingMergedData = hstack((trainingMergedData,\
tfIdfWeightedWord2VecTitlesVectorsSub,\
tfIdfWeightedWord2VecEssaysVectorsSub));
testMergedData = hstack((testMergedData,\
tfIdfWeightedTitleTransformedTestData,\
tfIdfWeightedEssayTransformedTestData));
for testKValue in tqdm(testKValues):
knnClassifier = KNeighborsClassifier(n_neighbors = testKValue, algorithm = 'brute');
scores = cross_val_score(knnClassifier, trainingMergedData, classesTraining, cv = 5, scoring = 'roc_auc');
areaUnderRocValuesTrain.append(np.array(scores).mean());
plt.plot(testKValues, areaUnderRocValuesTrain, 'r', label = "Training K vs AUC");
plt.title("Training Data K vs AUC - {} vectorized text".format(technique));
plt.xlabel("Hyper parameter - K");
plt.ylabel("Area under curve - AUC");
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show();
optimalKValue = testKValues[np.argmax(areaUnderRocValuesTrain)];
knnClassifier = KNeighborsClassifier(n_neighbors = optimalKValue, algorithm = 'brute');
knnClassifier.fit(trainingMergedData, classesTraining);
predProbScoresTraining = knnClassifier.predict_proba(trainingMergedData);
fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predProbScoresTraining[:, 1]);
predProbScoresTest = knnClassifier.predict_proba(testMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predProbScoresTest[:, 1]);
areaUnderRocValueTest = auc(fprTest, tprTest);
plt.plot(fprTrain, tprTrain, 'y', label="Train ROC curve - {} vectorized text".format(technique));
plt.plot(fprTest, tprTest, 'g', label="Test ROC curve - {} vectorized text".format(technique));
plt.plot([0, 1], [0, 1], 'k-');
plt.title("ROC Curves for train and test data using k-value {}".format(optimalKValue))
plt.xlabel('False positive rate - FPR');
plt.ylabel('True positive rate - TPR');
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show();
print("Results of analysis using {} vectorized text features merged with other features using K-NN brute force algorithm:".format(technique));
equalsBorder(70);
print("AUC values of train data: ");
equalsBorder(40);
print(areaUnderRocValuesTrain);
equalsBorder(40);
print("Optimal K-Value: ", optimalKValue);
equalsBorder(40);
print("AUC value of test data: ", areaUnderRocValueTest);
# Predicting classes of test data projects
predictionClassesTest = knnClassifier.predict(testMergedData);
equalsBorder(40);
# Adding results to results dataframe
kFoldResultsDataFrame = kFoldResultsDataFrame.append({'Vectorizer': technique, 'Model': 'Brute', 'Hyper Parameter - K': optimalKValue, 'AUC': areaUnderRocValueTest}, ignore_index = True);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
print(confusionMatrixDataFrame);
equalsBorder(110);
equalsBorder(110);
equalsBorder(110);
testKValues = np.arange(1, 40, 2);
techniques = ['Bag of Words', 'Tf-Idf', 'Average Word2Vec', 'Tf-Idf Weighted Word2Vec'];
for index, technique in enumerate(techniques):
areaUnderRocValuesTrain = [];
trainingMergedData = hstack((categoriesVectorsSub,\
subCategoriesVectorsSub,\
teacherPrefixVectorsSub,\
schoolStateVectorsSub,\
projectGradeVectorsSub,\
priceStandardizedSub,\
previouslyPostedStandardizedSub));
testMergedData = hstack((categoriesTransformedTestData,\
subCategoriesTransformedTestData,\
teacherPrefixTransformedTestData,\
schoolStateTransformedTestData,\
projectGradeTransformedTestData,\
priceTransformedTestData,\
previouslyPostedTransformedTestData));
if(index == 0):
trainingMergedData = hstack((trainingMergedData,\
bowTitleModelSub,\
bowEssayModelSub));
testMergedData = hstack((testMergedData,\
bowTitleTransformedTestData,\
bowEssayTransformedTestData));
elif(index == 1):
trainingMergedData = hstack((trainingMergedData,\
tfIdfTitleModelSub,\
tfIdfEssayModelSub));
testMergedData = hstack((testMergedData,\
tfIdfTitleTransformedTestData,\
tfIdfEssayTransformedTestData));
elif(index == 2):
trainingMergedData = hstack((trainingMergedData,\
word2VecTitlesVectorsSub,\
word2VecEssaysVectorsSub));
testMergedData = hstack((testMergedData,\
word2VecTitleTransformedTestData,\
word2VecEssayTransformedTestData));
elif(index == 3):
trainingMergedData = hstack((trainingMergedData,\
tfIdfWeightedWord2VecTitlesVectorsSub,\
tfIdfWeightedWord2VecEssaysVectorsSub));
testMergedData = hstack((testMergedData,\
tfIdfWeightedTitleTransformedTestData,\
tfIdfWeightedEssayTransformedTestData));
for testKValue in tqdm(testKValues):
knnClassifier = KNeighborsClassifier(n_neighbors = testKValue, algorithm = 'brute');
scores = cross_val_score(knnClassifier, trainingMergedData, classesTraining, cv = 5, scoring = 'roc_auc');
areaUnderRocValuesTrain.append(np.array(scores).mean());
plt.plot(testKValues, areaUnderRocValuesTrain, 'r', label = "Training K vs AUC");
plt.title("Training Data K vs AUC - {} vectorized text".format(technique));
plt.xlabel("Hyper parameter - K");
plt.ylabel("Area under curve - AUC");
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show();
optimalKValue = testKValues[np.argmax(areaUnderRocValuesTrain)];
knnClassifier = KNeighborsClassifier(n_neighbors = optimalKValue, algorithm = 'brute');
knnClassifier.fit(trainingMergedData, classesTraining);
predProbScoresTraining = knnClassifier.predict_proba(trainingMergedData);
fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predProbScoresTraining[:, 1]);
predProbScoresTest = knnClassifier.predict_proba(testMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predProbScoresTest[:, 1]);
areaUnderRocValueTest = auc(fprTest, tprTest);
plt.plot(fprTrain, tprTrain, 'y', label="Train ROC curve - {} vectorized text".format(technique));
plt.plot(fprTest, tprTest, 'g', label="Test ROC curve - {} vectorized text".format(technique));
plt.plot([0, 1], [0, 1], 'k-');
plt.title("ROC Curves for train and test data using k-value {}".format(optimalKValue))
plt.xlabel('False positive rate - FPR');
plt.ylabel('True positive rate - TPR');
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show();
print("Results of analysis using {} vectorized text features merged with other features using K-NN brute force algorithm:".format(technique));
equalsBorder(70);
print("AUC values of train data: ");
equalsBorder(40);
print(areaUnderRocValuesTrain);
equalsBorder(40);
print("Optimal K-Value: ", optimalKValue);
equalsBorder(40);
print("AUC value of test data: ", areaUnderRocValueTest);
# Predicting classes of test data projects
predictionClassesTest = knnClassifier.predict(testMergedData);
equalsBorder(40);
# Adding results to results dataframe
kFoldResultsDataFrame = kFoldResultsDataFrame.append({'Vectorizer': technique, 'Model': 'Brute', 'Hyper Parameter - K': optimalKValue, 'AUC': areaUnderRocValueTest}, ignore_index = True);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
print(confusionMatrixDataFrame);
equalsBorder(110);
equalsBorder(110);
equalsBorder(110);
trainingMergedData = hstack((categoriesVectorsSub,\
subCategoriesVectorsSub,\
teacherPrefixVectorsSub,\
schoolStateVectorsSub,\
projectGradeVectorsSub,\
priceStandardizedSub,\
previouslyPostedStandardizedSub));
testMergedData = hstack((categoriesTransformedTestData,\
subCategoriesTransformedTestData,\
teacherPrefixTransformedTestData,\
schoolStateTransformedTestData,\
projectGradeTransformedTestData,\
priceTransformedTestData,\
previouslyPostedTransformedTestData));
trainingMergedData = hstack((trainingMergedData,\
tfIdfTitleModelSub,\
tfIdfEssayModelSub));
testMergedData = hstack((testMergedData,\
tfIdfTitleTransformedTestData,\
tfIdfEssayTransformedTestData));
print("Training data shape: ", trainingMergedData.shape);
print("Test data shape: ", testMergedData.shape);
print("Classes Training shape: ", classesTraining.shape);
selectKBest = SelectKBest(f_classif, k = 2000);
filteredFeaturesTrainingMergedData = selectKBest.fit_transform(trainingMergedData, classesTraining);
filteredFeaturesTrainingMergedData.shape
selectedFeaturesResultsDataFrame = pd.DataFrame(columns = ['Vectorizer', 'Model', 'Hyper Parameter - K', 'AUC']);
selectedFeaturesResultsDataFrame
testKValues = np.arange(1, 40, 2);
areaUnderRocValuesTrain = [];
for testKValue in tqdm(testKValues):
knnClassifier = KNeighborsClassifier(n_neighbors = testKValue, algorithm = 'brute');
scores = cross_val_score(knnClassifier, filteredFeaturesTrainingMergedData, classesTraining, cv = 10, scoring = 'roc_auc');
areaUnderRocValuesTrain.append(np.array(scores).mean());
plt.plot(testKValues, areaUnderRocValuesTrain, 'r', label = "Training K vs AUC");
plt.title("Training Data K vs AUC - {} vectorized text".format("Tf-Idf"));
plt.xlabel("Hyper parameter - K");
plt.ylabel("Area under curve - AUC");
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show();
optimalKValue = testKValues[np.argmax(areaUnderRocValuesTrain)];
knnClassifier = KNeighborsClassifier(n_neighbors = optimalKValue, algorithm = 'brute');
knnClassifier.fit(filteredFeaturesTrainingMergedData, classesTraining);
predProbScoresTraining = knnClassifier.predict_proba(filteredFeaturesTrainingMergedData);
fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predProbScoresTraining[:, 1]);
filteredFeaturesTestMergedData = selectKBest.transform(testMergedData);
predProbScoresTest = knnClassifier.predict_proba(filteredFeaturesTestMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predProbScoresTest[:, 1]);
areaUnderRocValueTest = auc(fprTest, tprTest);
plt.plot(fprTrain, tprTrain, 'y', label="Train ROC curve - {} vectorized text".format("Tf-Idf"));
plt.plot(fprTest, tprTest, 'g', label="Test ROC curve - {} vectorized text".format("Tf-Idf"));
plt.plot([0, 1], [0, 1], 'k-');
plt.title("ROC Curves for train and test data using k-value {}".format(optimalKValue))
plt.xlabel('False positive rate - FPR');
plt.ylabel('True positive rate - TPR');
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show();
print("Results of analysis using {} vectorized text features merged with other features using K-NN brute force algorithm:".format("Tf-Idf"));
equalsBorder(70);
print("AUC values of train data: ");
equalsBorder(40);
print(areaUnderRocValuesTrain);
equalsBorder(40);
print("Optimal K-Value: ", optimalKValue);
equalsBorder(40);
print("AUC value of test data: ", areaUnderRocValueTest);
# Predicting classes of test data projects
predictionClassesTest = knnClassifier.predict(filteredFeaturesTestMergedData);
equalsBorder(40);
# Adding results to results dataframe
selectedFeaturesResultsDataFrame = selectedFeaturesResultsDataFrame.append({'Vectorizer': "Tf-Idf", 'Model': 'Brute', 'Hyper Parameter - K': optimalKValue, 'AUC': areaUnderRocValueTest}, ignore_index = True);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
confusionMatrixDataFrame
testKValues = np.arange(1, 40, 2);
areaUnderRocValuesTrain = [];
for testKValue in tqdm(testKValues):
knnClassifier = KNeighborsClassifier(n_neighbors = testKValue, algorithm = 'brute');
scores = cross_val_score(knnClassifier, filteredFeaturesTrainingMergedData, classesTraining, cv = 10, scoring = 'roc_auc');
areaUnderRocValuesTrain.append(np.array(scores).mean());
plt.plot(testKValues, areaUnderRocValuesTrain, 'r', label = "Training K vs AUC");
plt.title("Training Data K vs AUC - {} vectorized text".format("Tf-Idf"));
plt.xlabel("Hyper parameter - K");
plt.ylabel("Area under curve - AUC");
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show();
optimalKValue = testKValues[np.argmax(areaUnderRocValuesTrain)];
knnClassifier = KNeighborsClassifier(n_neighbors = optimalKValue, algorithm = 'brute');
knnClassifier.fit(filteredFeaturesTrainingMergedData, classesTraining);
predProbScoresTraining = knnClassifier.predict_proba(filteredFeaturesTrainingMergedData);
fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predProbScoresTraining[:, 1]);
filteredFeaturesTestMergedData = selectKBest.transform(testMergedData);
predProbScoresTest = knnClassifier.predict_proba(filteredFeaturesTestMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predProbScoresTest[:, 1]);
areaUnderRocValueTest = auc(fprTest, tprTest);
plt.plot(fprTrain, tprTrain, 'y', label="Train ROC curve - {} vectorized text".format("Tf-Idf"));
plt.plot(fprTest, tprTest, 'g', label="Test ROC curve - {} vectorized text".format("Tf-Idf"));
plt.plot([0, 1], [0, 1], 'k-');
plt.title("ROC Curves for train and test data using k-value {}".format(optimalKValue))
plt.xlabel('False positive rate - FPR');
plt.ylabel('True positive rate - TPR');
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show();
print("Results of analysis using {} vectorized text features merged with other features using K-NN brute force algorithm:".format("Tf-Idf"));
equalsBorder(70);
print("AUC values of train data: ");
equalsBorder(40);
print(areaUnderRocValuesTrain);
equalsBorder(40);
print("Optimal K-Value: ", optimalKValue);
equalsBorder(40);
print("AUC value of test data: ", areaUnderRocValueTest);
# Predicting classes of test data projects
predictionClassesTest = knnClassifier.predict(filteredFeaturesTestMergedData);
equalsBorder(40);
# Adding results to results dataframe
selectedFeaturesResultsDataFrame = selectedFeaturesResultsDataFrame.append({'Vectorizer': "Tf-Idf", 'Model': 'Brute', 'Hyper Parameter - K': optimalKValue, 'AUC': areaUnderRocValueTest}, ignore_index = True);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
confusionMatrixDataFrame
resultsDataFrame
kFoldResultsDataFrame
selectedFeaturesResultsDataFrame